Python parallel output concatenation

I am using something like this:

find folder/ | xargs -n1 -P10 ./logger.py > collab 

Inside logger.py I process files that output reformatted lines. Therefore, collab should look like

 {'filename' : 'file1', 'size' : 1000} {'filename' : 'file1', 'size' : 1000} {'filename' : 'file1', 'size' : 1000} {'filename' : 'file1', 'size' : 1000} 

Instead, sometimes the lines become messy:

 {'filename' : 'file1', 'size' : 1000} {'file {'filename' : 'file1', 'size' : 1000} name' : 'file1', 'size' : 1000} {'filename' : 'file1', 'size' : 1000} 

How can I prevent / fix this?

+8
python bash xargs
source share
3 answers

In general, there are problems due to which it is very difficult to guarantee that this will not happen without the need to integrate into a multiprocessor lock. However, you can usually significantly reduce the problem.

The most common reason for this is I / O buffering in Python or libc. For example, it could be buffering 16k output, and then writing the entire block at once. You can reduce this by dropping stdout after writing to it, but it's awkward. Theoretically, you should pass -u to Python to disable stdout buffering, but that didn't work when I tried. See Sebastian's answer to Disable output buffering for a more general solution (although there is probably an opportunity to disable output buffering directly).

The second problem is that master records are not always atomic. In particular, the records in the channels are only atomic up to a certain size (PIPE_BUF, usually 512 bytes); higher than this is not guaranteed. This is strictly applicable only to pipes (and not to files), but the same general problems apply: smaller records most often occur atomically. See http://www.opengroup.org/onlinepubs/000095399/functions/write.html .
+2
source share

A difficult and technically correct solution will be to implement a mutex for recording, but, judging by my opinion, this is suboptimal.

And this is not fun. How about getting output from xargs (so you get solid pieces of output, not a stream stream that breaks up), and then somehow combine these pieces?

+1
source share

The problem is that the output from xargs is mixed. GNU Parallel is designed to solve this problem. By default, it ensures that the output does not mix. So you can just do this:

 find folder/ | parallel ./logger.py > collab 

This will launch one logger.py for each processor. If you want 10:

 find folder/ | parallel -P10 ./logger.py > collab 

Watch intro video to learn more about GNU Parallel http://www.youtube.com/watch?v=OpaiGYxkSuQ

+1
source share

All Articles