Python: sharing huge dictionaries using multiprocessing

Question

Python: sharing huge dictionaries using multiprocessing

I process very large amounts of data stored in a dictionary using multiprocessing. Basically, all I do is load some signatures stored in the dictionary, create a common dict object from it (get the “proxy” object returned by Manager.dict ()), and pass this proxy as an argument to the function, which has to perform multiprocessing.

Just clarify:

signatures = dict() load_signatures(signatures) [...] manager = Manager() signaturesProxy = manager.dict(signatures) [...] result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )

Now everything works fine if the signatures are less than 2 million records or so. In any case, I have to process the dictionary with 5.8M keys (tracing the binary signatures creates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:

 Traceback (most recent call last): File "matrix.py", line 617, in <module> signaturesProxy = manager.dict(signatures) File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp token, exp = self._create(typeid, *args, **kwds) File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds) File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) multiprocessing.managers.RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python2.6/multiprocessing/managers.py", line 173, in handle_request request = c.recv() EOFError ---------------------------------------------------------------------------

I know that the data structure is huge, but I'm working on a machine with 32 GB of RAM, and from top to bottom I see that the process after downloading the signatures takes 7 GB of RAM. Then he begins to build a proxy object, and the RAM usage reaches ~ 17 GB of RAM, but never approaches 32. At this stage, RAM usage starts to decrease rapidly, and the process ends with the above error. Therefore, I assume that this is not due to a memory error ...

Any idea or suggestion?

Thanks,

David

+6

python dictionary multiprocessing shared-objects

Davide c Dec 26 '10 at 17:15

source share

4 answers

Why don't you try this with a database? Databases are not limited to an address / physical port and are safe for multithreading / process use.

+6

user180326 Dec 26 '10 at 17:27

source share

In the interest of saving time and not having to debug problems at the system level, perhaps you could split your 5.8 million records into three sets of 2 million each and run the task 3 times.

+2

Fragsworth Dec 26 '10 at 17:24

source share

I think the problem you ran into was that the dict or hash table changed size as it grows. Initially, dict has a certain number of buckets available. I'm not sure about Python, but I know that Perl starts with 8, and then when the buckets are full, the hash is recreated for another 8 (i.e. 8, 16, 32, ...).

A bucket is a landing site for a hash algorithm. 8 slots do not mean 8 entries, it means 8 memory locations. When a new item is added, a hash is created for this key and then stored in this bucket.

There are clashes. The more elements that are in the bucket, the slower the function will be, because the elements are added sequentially due to the dynamic size of the slot.

One of the problems that may arise is that your keys are very similar and produce the same hash result - this means that most keys are in the same slot. Pre-allocating hash buckets will help eliminate this and actually improve processing time and key management, and you no longer need to do whatever swap happens.

However, I think you are still limited by the amount of free contiguous memory and ultimately have to move on to solving the database.

side of the note: I'm still new to Python, I know that in Perl you can see the hash statistics by printing% HASHNAME, it will show your bucket usage distribution. Helps you identify the number of collisions - if you need to pre-distribute buckets. Can this be done in Python?

Rich

0

Rich Feb 21 '12 at 17:43

source share

Glenn maynard · Accepted Answer · 2010-12-26T18:31:19+0000

If dictionaries are read-only, you do not need proxy objects on most operating systems.

Just download the dictionaries before launching the workers, and place them where they will be available; the easiest place is globally for the module. They will be read from workers.

 from multiprocessing import Pool buf = "" def f(x): buf.find("x") return 0 if __name__ == '__main__': buf = "a" * 1024 * 1024 * 1024 pool = Pool(processes=1) result = pool.apply_async(f, [10]) print result.get(timeout=5)

In this case, only 1 GB of memory is used, and not 1 GB for each process, because any modern OS will create a shadow copy of the data created before the plug. Just remember that data changes will not be visible to other workers, and memory will, of course, be allocated to any data that you change.

It will use some memory: the page of each object containing the reference count will be changed, so it will be allocated. It doesn't matter if it depends on the data.

This will work on any OS that implements normal forcing. It will not work on Windows; his (crippled) process model requires restarting the entire process for each worker, so he is not very good at exchanging data.

Python: sharing huge dictionaries using multiprocessing

More articles: