We have a problem that is awkwardly parallel - we run a large number of instances of the same program with a different set of data for each; we do this simply by sending the application many times to a batch queue with different parameters each time.
However, with a large number of tasks, not all of them are completed. This does not look like a queue problem - all jobs start.
The problem is that with a large number of application instances, many tasks end at about the same time and, therefore, everyone is trying to write their data to the parallel file system at about the same time,
Then the problem is that either the program cannot write to the file system, or somehow crashes, or just sits there, waiting for a write, and the batch queue system kills the task after it sits for too long, (From of what I collected on the problem, most of the tasks that cannot be completed, if not all, do not leave the kernel files)
What is the best way to schedule disk burning to avoid this problem? I mention that our program is awkwardly parallel to emphasizing the fact that each process does not know the others - they cannot talk to each other to somehow plan their recordings.
Although I have the source code for the program, we would like to solve the problem without having to modify it, if possible, since we do not support it or develop it (plus most of the comments are written in Italian).
I had some thoughts about this:
- Each task is first written to the local (from scratch) node disk. Then we can run another task, which checks from time to time which tasks are completed, and moves files from local disks to a parallel file system.
- Use the MPI shell around the program in the master / slave system, where the master manages the queue of tasks and farms that they transmit to each slave device; and the led shell starts the application and catches the exception (can I do it reliably for the timeout of the file system in C ++ or, possibly, Java?), and sends a message back to the wizard to restart the job
In the meantime, I need to pester my supervisors for more information about the error itself - I never encountered this personally, but I did not have to use the program for a very large number of data sets (for now).
In case this is useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file system is NFS4, and Solaris also runs on storage servers. HPC nodes and nodes exchange data over fiber channels.