Python reading shared memory

I work with a ~ 8 GB dataset and I also use scikit-learn to train different ML models on it. The data set is basically a list of 1D vectors of ints.

How to make a dataset available to multiple python processes, or how can I code a dataset so that I can use multiprocessing classes? I read in ctypes and I also read in the multiprocessing documentation, but I am very confused. I only need to make the data available for each process so that I can train models with it.

Do i need to have shared multiprocessing variables like ctypes?

How can I present a dataset as ctypes ?

+5
source share
2 answers

I assume that you can load the entire dataset into RAM into a numpy array and work with Linux or Mac. (If you are on Windows or you cannot put the array into RAM, then you probably should copy the array to a file on disk and use numpy.memmap to access it. Your computer will cache the data from the disk into RAM, and it may also be that these caches will be shared between processes, so this is not a terrible decision.)

According to the above assumptions, if you need read-only access to a dataset in other processes created using multiprocessing , you can simply create a dataset and then start other processes. They will have read-only access to data from the original namespace. They can modify data from the original namespace, but these changes will not be visible to other processes (the memory manager will copy each memory segment that they change to the local memory card).

If your other processes need to modify the original dataset and make these changes visible to the parent process or other processes, you can use something like this:

 import multiprocessing import numpy as np # create your big dataset big_data = np.zeros((3, 3)) # create a shared-memory wrapper for big_data underlying data # (it doesn't matter what datatype we use, and 'c' is easiest) # I think if lock=True, you get a serialized object, which you don't want. # Note: you will need to setup your own method to synchronize access to big_data. buf = multiprocessing.Array('c', big_data.data, lock=False) # at this point, buf and big_data.data point to the same block of memory, # (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason # changes aren't propagated between them unless you do the following: big_data.data = buf # now you can update big_data from any process: def add_one_direct(): big_data[:] = big_data + 1 def add_one(a): # People say this won't work, since Process() will pickle the argument. # But in my experience Process() seems to pass the argument via shared # memory, so it works OK. a[:] = a+1 print "starting value:" print big_data p = multiprocessing.Process(target=add_one_direct) p.start() p.join() print "after add_one_direct():" print big_data p = multiprocessing.Process(target=add_one, args=(big_data,)) p.start() p.join() print "after add_one():" print big_data 
+3
source

May be a duplicate. Share a large, read-only array of arrays between multiprocessing processes.

You can convert your dataset from the current view to a new numpy memmap object and use it from each process. But in any case, it will not be very fast, it just gives some abstraction of working with an array from ram, in fact it will be a file from the hard drive, partially cached in RAM. Therefore, you should prefer scikit-learn algos with partial_fit methods and use them.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

In fact, joblib (which is used in scikit-learn for parallelization) automatically converts your data set into a memmap view to use from different processes (if it is large enough, of course).

+1
source

All Articles