How fast can i send sequential and independent jobs using qsub?

Question

How fast can i send sequential and independent jobs using qsub?

This question is related to pbs job no output when busy . ie Some of the jobs I submit do not exit when PBS / Torque is busy. I believe that this is an activity when many jobs are submitted one after another, and, as it happens, from tasks presented in this way, I often get some that do not produce any results.

Here are some codes.

Suppose I have a python script called "x_analyse.py" that takes a file containing some data as its file and parses the data stored in the file:

./x_analyse.py data_1.pkl

Now, suppose I need: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..., data_N.pkl (2) Have "x_analyse.py" work on each of them and write the results to a file for each of them. (3) Since the analysis of different data files is independent of each other, I am going to use PBS / Torque to run them in parallel to save time. (I think this is essentially a "confusing parallel issue.")

I have this python script to do the following:

 import os import sys import time N = 100 for k in range(1, N+1): datafilename = 'data_%d' % k file = open(datafilename + '.pkl', 'wb') #Prepare data set k, and save it in the file file.close() jobname = 'analysis_%d' % k file = open(jobname + '.sub', 'w') file.writelines( [ '#!/bin/bash\n', '#PBS -N %s\n' % jobname, '#PBS -o %s\n' % (jobname + '.out'), '#PBS -q compute\n' , '#PBS -j oe\n' , '#PBS -l nodes=1:ppn=1\n' , '#PBS -l walltime=5:00:00\n' , 'cd $PBS_O_WORKDIR\n' , '\n' , './x_analyse.py %s\n' % (datafilename + '.pkl') ] ) file.close() os.system('qsub %s' % (jobname + '.sub')) time.sleep(2.)

the script prepares the data set for analysis, saves it to a file, writes the pbs send file for analysis of this data set, sends the task to execute it, and then again moves to the same next data set, etc.

As when running the script, a list of job IDs is printed to standard output when sending jobs. "ls" indicates that there are N.sub files and N.pkl data files. "qstat" indicates that all jobs are started with status "R" and then completed with status "C". However, after this, “ls” indicates that the output files are less than N.out and less than N result files created by “x_analyse.py”. In fact, no output is produced by some jobs. If I cleared everything and restarted the above script, I would get the same behavior, with some tasks (but not the same as last time), without producing any output.

It has been suggested that by increasing the waiting time between submission of consecutive jobs, everything is improving.

 time.sleep(10.) #or some other waiting time

But I feel that this is not very satisfactory because I tried 0.1s, 0.5s, 1.0s, 2.0s, 3.0s, none of which really helped. I was told that the 50s timeout seems to work fine, but if I need to send 100 jobs, the timeout will be around 5000, which seems terribly long.

I tried to reduce the number of times qsub is used instead of sending an array of jobs. I would prepare all the data files as before, but only one file file, "analyse_all.sub":

 #!/bin/bash #PBS -N analyse #PBS -o analyse.out #PBS -q compute #PBS -j oe #PBS -l nodes=1:ppn=1 #PBS -l walltime=5:00:00 cd $PBS_O_WORKDIR ./x_analyse.py data_$PBS_ARRAYID.pkl

and then send using

 qsub -t 1-100 analyse_all.sub

But even so, some tasks still do not produce a conclusion.

Is this a common problem? Am I doing something wrong? Is there a better solution between job ideas? Can I do something to improve this?

Thanks in advance for your help.

Change 1:

I am using Torque version 2.4.7 and version Maui 3.3.

In addition, suppose that the task with the task identifier 1184430.mgt1 does not produce output, and the task with the task identifier 1184431.mgt1 displays the result, as expected, when I use the "tracejob" on them, I get the following:

 [ batman@gotham tmp]$tracejob 1184430.mgt1 /var/spool/torque/server_priv/accounting/20121213: Permission denied /var/spool/torque/mom_logs/20121213: No such file or directory /var/spool/torque/sched_logs/20121213: No such file or directory Job: 1184430.mgt1 12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1 12/13/2012 13:53:13 S Job Queued at request of batman@mgt1 , owner = batman@mgt1 , job name = analysis_1, queue = compute 12/13/2012 13:53:13 S Job Run at request of root@mgt1 12/13/2012 13:53:13 S Not sending email: User does not want mail of this type. 12/13/2012 13:54:48 S Not sending email: User does not want mail of this type. 12/13/2012 13:54:48 S Exit_status=135 resources_used.cput=00:00:00 resources_used.mem=15596kb resources_used.vmem=150200kb resources_used.walltime=00:01:35 12/13/2012 13:54:53 S Post job file processing error 12/13/2012 13:54:53 S Email 'o' to batman@mgt1 failed: Child process '/usr/lib/sendmail -f adm batman@mgt1 ' returned 67 (errno 10:No child processes) [ batman@gotham tmp]$tracejob 1184431.mgt1 /var/spool/torque/server_priv/accounting/20121213: Permission denied /var/spool/torque/mom_logs/20121213: No such file or directory /var/spool/torque/sched_logs/20121213: No such file or directory Job: 1184431.mgt1 12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1 12/13/2012 13:53:13 S Job Queued at request of batman@mgt1 , owner = batman@mgt1 , job name = analysis_2, queue = compute 12/13/2012 13:53:13 S Job Run at request of root@mgt1 12/13/2012 13:53:13 S Not sending email: User does not want mail of this type. 12/13/2012 13:53:31 S Not sending email: User does not want mail of this type. 12/13/2012 13:53:31 S Exit_status=0 resources_used.cput=00:00:16 resources_used.mem=19804kb resources_used.vmem=154364kb resources_used.walltime=00:00:18

Edit 2: For a job that does not produce output, 'qstat -f' returns the following:

 [ batman@gotham tmp]$qstat -f 1184673.mgt1 Job Id: 1184673.mgt1 Job_Name = analysis_7 Job_Owner = batman@mgt1 resources_used.cput = 00:00:16 resources_used.mem = 17572kb resources_used.vmem = 152020kb resources_used.walltime = 00:01:36 job_state = C queue = compute server = mgt1 Checkpoint = u ctime = Fri Dec 14 14:00:31 2012 Error_Path = mgt1:/gpfs1/batman/tmp/analysis_7.e1184673 exec_host = node26/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Fri Dec 14 14:02:07 2012 Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_7.out Priority = 0 qtime = Fri Dec 14 14:00:31 2012 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 05:00:00 session_id = 9397 Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman, PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin, PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash, PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp, PBS_O_QUEUE=compute,PBS_O_HOST=mgt1 sched_hint = Post job file processing error; job 1184673.mgt1 on host node 26/0Unknown resource type REJHOST=node26 MSG=invalid home directory ' /gpfs1/batman' specified, errno=116 (Stale NFS file handle) etime = Fri Dec 14 14:00:31 2012 exit_status = 135 submit_args = analysis_7.sub start_time = Fri Dec 14 14:00:31 2012 Walltime.Remaining = 1790 start_count = 1 fault_tolerant = False comp_time = Fri Dec 14 14:02:07 2012

in comparison with the task that produces the conclusion:

 [ batman@gotham tmp]$qstat -f 1184687.mgt1 Job Id: 1184687.mgt1 Job_Name = analysis_1 Job_Owner = batman@mgt1 resources_used.cput = 00:00:16 resources_used.mem = 19652kb resources_used.vmem = 162356kb resources_used.walltime = 00:02:38 job_state = C queue = compute server = mgt1 Checkpoint = u ctime = Fri Dec 14 14:40:46 2012 Error_Path = mgt1:/gpfs1/batman/tmp/analysis_1.e118468 7 exec_host = ionode2/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Fri Dec 14 14:43:24 2012 Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_1.out Priority = 0 qtime = Fri Dec 14 14:40:46 2012 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 05:00:00 session_id = 28039 Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman, PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin, PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash, PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp, PBS_O_QUEUE=compute,PBS_O_HOST=mgt1 etime = Fri Dec 14 14:40:46 2012 exit_status = 0 submit_args = analysis_1.sub start_time = Fri Dec 14 14:40:47 2012 Walltime.Remaining = 1784 start_count = 1

It appears that the exit status for one is 0, but not the other.

Edit 3:

From the outputs of 'qstat -f' like the ones above, it seems that the problem has something to do with Stale NFS file processing in file processing after the job. By providing hundreds of test cases, I was able to identify several nodes that create failed tasks. Using ssh on them, I can find the missing PBS output files in /var/spool/torque/spool , where I can also see the output files belonging to other users. One strange thing about these problematic nodes is that if they are the only node selected for use, the job works fine for them. The problem only arises when they are mixed with other nodes.

Since I don’t know how to fix the processing of post-jobs 'Stale NFS file handle', I avoid these nodes by sending them jobs 'dummy'

 echo sleep 60 | qsub -lnodes=badnode1:ppn=2+badnode2:ppn=2

before sending real assignments. Now all jobs produce output as expected, and there is no need to wait until successive submissions.

+4

python cluster-computing pbs qsub torque

Jack Dec 12 '12 at 22:13

source share

1 answer

Dmitri Chubarov · Accepted Answer · 2012-12-14T12:39:27+0000

I see two problems in tracejob output from a failed job.

First it is Exit_status=135 . This exit status is not a torque error code, but the exit status is returned by a script that is x_analyse.py . Python does not have an agreement to use the sys.exit() function, and the source code 135 may be in one of the modules used in the script.

The second problem is the denial of processing files after the job. This may indicate a misconfigured node.

From now on, I guess. Since successful work takes about 00:00:16, it is likely that with a delay of 50 seconds you have all your jobs on the first available node. With less delay, you get more nodes involved and end up in a misconfigured node or you get two scripts at the same time with one node. I would modify the submit script by adding the line

  'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',

for a python script that generates a .sub file. This would add the runtime host names to debug.log, which will be on the shared file system if I understood your setup correctly.

Then, you (or the Torque administrator) may want to find the raw output files in the MOM spool directory on the node failure to get additional information for further diagnosis.

How fast can i send sequential and independent jobs using qsub?

More articles: