I process very large amounts of data stored in a dictionary using multiprocessing. Basically, all I do is load some signatures stored in the dictionary, create a common dict object from it (get the βproxyβ object returned by Manager.dict ()), and pass this proxy as an argument to the function, which has to perform multiprocessing.
Just clarify:
signatures = dict() load_signatures(signatures) [...] manager = Manager() signaturesProxy = manager.dict(signatures) [...] result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )
Now everything works fine if the signatures are less than 2 million records or so. In any case, I have to process the dictionary with 5.8M keys (tracing the binary signatures creates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:
Traceback (most recent call last): File "matrix.py", line 617, in <module> signaturesProxy = manager.dict(signatures) File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp token, exp = self._create(typeid, *args, **kwds) File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds) File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch raise convert_to_error(kind, result) multiprocessing.managers.RemoteError:
I know that the data structure is huge, but I'm working on a machine with 32 GB of RAM, and from top to bottom I see that the process after downloading the signatures takes 7 GB of RAM. Then he begins to build a proxy object, and the RAM usage reaches ~ 17 GB of RAM, but never approaches 32. At this stage, RAM usage starts to decrease rapidly, and the process ends with the above error. Therefore, I assume that this is not due to a memory error ...
Any idea or suggestion?
Thanks,
David
python dictionary multiprocessing shared-objects
Davide c
source share