This question is related to pbs job no output when busy . ie Some of the jobs I submit do not exit when PBS / Torque is busy. I believe that this is an activity when many jobs are submitted one after another, and, as it happens, from tasks presented in this way, I often get some that do not produce any results.
Here are some codes.
Suppose I have a python script called "x_analyse.py" that takes a file containing some data as its file and parses the data stored in the file:
./x_analyse.py data_1.pkl
Now, suppose I need: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..., data_N.pkl (2) Have "x_analyse.py" work on each of them and write the results to a file for each of them. (3) Since the analysis of different data files is independent of each other, I am going to use PBS / Torque to run them in parallel to save time. (I think this is essentially a "confusing parallel issue.")
I have this python script to do the following:
import os import sys import time N = 100 for k in range(1, N+1): datafilename = 'data_%d' % k file = open(datafilename + '.pkl', 'wb')
the script prepares the data set for analysis, saves it to a file, writes the pbs send file for analysis of this data set, sends the task to execute it, and then again moves to the same next data set, etc.
As when running the script, a list of job IDs is printed to standard output when sending jobs. "ls" indicates that there are N.sub files and N.pkl data files. "qstat" indicates that all jobs are started with status "R" and then completed with status "C". However, after this, βlsβ indicates that the output files are less than N.out and less than N result files created by βx_analyse.pyβ. In fact, no output is produced by some jobs. If I cleared everything and restarted the above script, I would get the same behavior, with some tasks (but not the same as last time), without producing any output.
It has been suggested that by increasing the waiting time between submission of consecutive jobs, everything is improving.
time.sleep(10.) #or some other waiting time
But I feel that this is not very satisfactory because I tried 0.1s, 0.5s, 1.0s, 2.0s, 3.0s, none of which really helped. I was told that the 50s timeout seems to work fine, but if I need to send 100 jobs, the timeout will be around 5000, which seems terribly long.
I tried to reduce the number of times qsub is used instead of sending an array of jobs. I would prepare all the data files as before, but only one file file, "analyse_all.sub":
#!/bin/bash
and then send using
qsub -t 1-100 analyse_all.sub
But even so, some tasks still do not produce a conclusion.
Is this a common problem? Am I doing something wrong? Is there a better solution between job ideas? Can I do something to improve this?
Thanks in advance for your help.
Change 1:
I am using Torque version 2.4.7 and version Maui 3.3.
In addition, suppose that the task with the task identifier 1184430.mgt1 does not produce output, and the task with the task identifier 1184431.mgt1 displays the result, as expected, when I use the "tracejob" on them, I get the following:
[ batman@gotham tmp]$tracejob 1184430.mgt1 /var/spool/torque/server_priv/accounting/20121213: Permission denied /var/spool/torque/mom_logs/20121213: No such file or directory /var/spool/torque/sched_logs/20121213: No such file or directory Job: 1184430.mgt1 12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1 12/13/2012 13:53:13 S Job Queued at request of batman@mgt1 , owner = batman@mgt1 , job name = analysis_1, queue = compute 12/13/2012 13:53:13 S Job Run at request of root@mgt1 12/13/2012 13:53:13 S Not sending email: User does not want mail of this type. 12/13/2012 13:54:48 S Not sending email: User does not want mail of this type. 12/13/2012 13:54:48 S Exit_status=135 resources_used.cput=00:00:00 resources_used.mem=15596kb resources_used.vmem=150200kb resources_used.walltime=00:01:35 12/13/2012 13:54:53 S Post job file processing error 12/13/2012 13:54:53 S Email 'o' to batman@mgt1 failed: Child process '/usr/lib/sendmail -f adm batman@mgt1 ' returned 67 (errno 10:No child processes) [ batman@gotham tmp]$tracejob 1184431.mgt1 /var/spool/torque/server_priv/accounting/20121213: Permission denied /var/spool/torque/mom_logs/20121213: No such file or directory /var/spool/torque/sched_logs/20121213: No such file or directory Job: 1184431.mgt1 12/13/2012 13:53:13 S enqueuing into compute, state 1 hop 1 12/13/2012 13:53:13 S Job Queued at request of batman@mgt1 , owner = batman@mgt1 , job name = analysis_2, queue = compute 12/13/2012 13:53:13 S Job Run at request of root@mgt1 12/13/2012 13:53:13 S Not sending email: User does not want mail of this type. 12/13/2012 13:53:31 S Not sending email: User does not want mail of this type. 12/13/2012 13:53:31 S Exit_status=0 resources_used.cput=00:00:16 resources_used.mem=19804kb resources_used.vmem=154364kb resources_used.walltime=00:00:18
Edit 2: For a job that does not produce output, 'qstat -f' returns the following:
[ batman@gotham tmp]$qstat -f 1184673.mgt1 Job Id: 1184673.mgt1 Job_Name = analysis_7 Job_Owner = batman@mgt1 resources_used.cput = 00:00:16 resources_used.mem = 17572kb resources_used.vmem = 152020kb resources_used.walltime = 00:01:36 job_state = C queue = compute server = mgt1 Checkpoint = u ctime = Fri Dec 14 14:00:31 2012 Error_Path = mgt1:/gpfs1/batman/tmp/analysis_7.e1184673 exec_host = node26/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Fri Dec 14 14:02:07 2012 Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_7.out Priority = 0 qtime = Fri Dec 14 14:00:31 2012 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 05:00:00 session_id = 9397 Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman, PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin, PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash, PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp, PBS_O_QUEUE=compute,PBS_O_HOST=mgt1 sched_hint = Post job file processing error; job 1184673.mgt1 on host node 26/0Unknown resource type REJHOST=node26 MSG=invalid home directory ' /gpfs1/batman' specified, errno=116 (Stale NFS file handle) etime = Fri Dec 14 14:00:31 2012 exit_status = 135 submit_args = analysis_7.sub start_time = Fri Dec 14 14:00:31 2012 Walltime.Remaining = 1790 start_count = 1 fault_tolerant = False comp_time = Fri Dec 14 14:02:07 2012
in comparison with the task that produces the conclusion:
[ batman@gotham tmp]$qstat -f 1184687.mgt1 Job Id: 1184687.mgt1 Job_Name = analysis_1 Job_Owner = batman@mgt1 resources_used.cput = 00:00:16 resources_used.mem = 19652kb resources_used.vmem = 162356kb resources_used.walltime = 00:02:38 job_state = C queue = compute server = mgt1 Checkpoint = u ctime = Fri Dec 14 14:40:46 2012 Error_Path = mgt1:/gpfs1/batman/tmp/analysis_1.e118468 7 exec_host = ionode2/0 Hold_Types = n Join_Path = oe Keep_Files = n Mail_Points = a mtime = Fri Dec 14 14:43:24 2012 Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_1.out Priority = 0 qtime = Fri Dec 14 14:40:46 2012 Rerunable = True Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=1 Resource_List.walltime = 05:00:00 session_id = 28039 Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman, PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin, PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash, PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp, PBS_O_QUEUE=compute,PBS_O_HOST=mgt1 etime = Fri Dec 14 14:40:46 2012 exit_status = 0 submit_args = analysis_1.sub start_time = Fri Dec 14 14:40:47 2012 Walltime.Remaining = 1784 start_count = 1
It appears that the exit status for one is 0, but not the other.
Edit 3:
From the outputs of 'qstat -f' like the ones above, it seems that the problem has something to do with Stale NFS file processing in file processing after the job. By providing hundreds of test cases, I was able to identify several nodes that create failed tasks. Using ssh on them, I can find the missing PBS output files in /var/spool/torque/spool , where I can also see the output files belonging to other users. One strange thing about these problematic nodes is that if they are the only node selected for use, the job works fine for them. The problem only arises when they are mixed with other nodes.
Since I donβt know how to fix the processing of post-jobs 'Stale NFS file handle', I avoid these nodes by sending them jobs 'dummy'
echo sleep 60 | qsub -lnodes=badnode1:ppn=2+badnode2:ppn=2
before sending real assignments. Now all jobs produce output as expected, and there is no need to wait until successive submissions.