What is the best way to avoid concurrent file system overload when performing awkwardly parallel jobs?

We have a problem that is awkwardly parallel - we run a large number of instances of the same program with a different set of data for each; we do this simply by sending the application many times to a batch queue with different parameters each time.

However, with a large number of tasks, not all of them are completed. This does not look like a queue problem - all jobs start.

The problem is that with a large number of application instances, many tasks end at about the same time and, therefore, everyone is trying to write their data to the parallel file system at about the same time,

Then the problem is that either the program cannot write to the file system, or somehow crashes, or just sits there, waiting for a write, and the batch queue system kills the task after it sits for too long, (From of what I collected on the problem, most of the tasks that cannot be completed, if not all, do not leave the kernel files)

What is the best way to schedule disk burning to avoid this problem? I mention that our program is awkwardly parallel to emphasizing the fact that each process does not know the others - they cannot talk to each other to somehow plan their recordings.

Although I have the source code for the program, we would like to solve the problem without having to modify it, if possible, since we do not support it or develop it (plus most of the comments are written in Italian).

I had some thoughts about this:

  • Each task is first written to the local (from scratch) node disk. Then we can run another task, which checks from time to time which tasks are completed, and moves files from local disks to a parallel file system.
  • Use the MPI shell around the program in the master / slave system, where the master manages the queue of tasks and farms that they transmit to each slave device; and the led shell starts the application and catches the exception (can I do it reliably for the timeout of the file system in C ++ or, possibly, Java?), and sends a message back to the wizard to restart the job

In the meantime, I need to pester my supervisors for more information about the error itself - I never encountered this personally, but I did not have to use the program for a very large number of data sets (for now).

In case this is useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file system is NFS4, and Solaris also runs on storage servers. HPC nodes and nodes exchange data over fiber channels.

+7
source share
3 answers

Most parallel file systems, especially in supercomputer centers, are designed for HPC applications, and not for type of serial farms. As a result, they are painstakingly optimized for bandwidth, and not for IOP (I / O operations per second), that is, they are aimed at large (1000+ work) jobs that write several mammoth files, rather than millions of small work jobs that display the number of small small files. All users easily run something that works fine (ish) on the desktop, and naively scale up to hundreds of simultaneous jobs to starve in the IOP system, hanging their jobs and usually others on the same systems.

The main thing you can do here is totality, totality, totality. It would be better if you could tell us where you work so that we can get additional information about the system. But some proven strategies:

  • If you output many files to one job, change the output strategy so that each job writes one file containing all the others. If you have a local ramdisk, you can do something as simple as writing them to ramdisk and then deleting them in a real file system.
  • Record in binary format, not in ascii. Big data never goes to ascii. Binary formats are written 10 times faster, slightly smaller, and you can write large chunks at a time, rather than a few numbers in a loop, which leads to:
  • Large recordings are better than small recordings. Each I / O operation is a file system. Make a few, big, writes, and don't go through tiny notes.
  • Similarly, do not write in formats that require you to search in different parts of a file at different times. Claims are slow and useless.
  • If you are doing a lot of tasks on node, you can use the same ramdisk trick as above (or local disk) to target all the outputs of the tasks and send them all to the parallel file system at once.

The above suggestions will be useful for the I / O performance of your code worldwide, and not just for parallel file systems. IO is everywhere everywhere, and the more you can do in memory, the less I / O you perform, the faster it will work. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.

Likewise, fewer large files, rather than many small files, will speed up everything from directories to backups on your file system; things are good.

+7
source

It’s hard to decide, you don’t know what exactly causes the crash. If you think this is a file system performance error, you can try the distributed file system: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html

If you want to implement a Master / Slave system, perhaps Hadoop might be the answer.

But first of all, I will try to find out what causes the failure ...

+2
source

The OS does not always behave well when they run out of resources; sometimes they simply interrupt a process that requests the first unit of a resource that the OS cannot provide. Many operating systems have limitations on file descriptor resources (on Windows, I think there are several thousand descriptor resources that you may encounter in circumstances similar to yours), and refusing to find a free descriptor usually means that the OS does bad things for the request process .

One simple solution requiring a program change is to agree that no more than N of your many tasks can be written right away. You will need a general semaphore that can see all tasks; most operating systems provide you with options for one, often as a named resource (!). Initialize the semaphore to N before starting any work. Ask each written assignment to get the resource unit from the semaphore when the assignment is about to write, and release this resource block when it is done. The amount of code to execute this should be a few lines inserted once in your highly parallel application. Then you tune N until you no longer have a problem. N == 1 will probably solve it, and you can probably do a lot better than that.

+1
source

All Articles