Working with big data in python and numpy, not enough RAM, how to save partial results to disk?

Question

Working with big data in python and numpy, not enough RAM, how to save partial results to disk?

I am trying to implement algorithms for 1000 dimensional data with 200k + datapoints in python. I want to use numpy, scipy, sklearn, networkx and other useful libraries. I want to perform operations such as pairwise distance between all points and do clustering at all points. I have implemented working algorithms that do what I want with reasonable complexity, but when I try to scale them to all my data, I run out of ram. Of course I do, creating a matrix for paired distances of 200k + data takes up a lot of memory.

Here comes the catch: I would really like to do this on crappy computers with a small amount of ram.

Is there a possible way to do this job without the limitations of a low ram. The fact that it will take much more time is actually not a problem if the time requirements do not go to infinity!

I would like my algorithms to work, and then come back an hour or five later, and not get stuck, because he ran out of ram! I would like to implement this in python and be able to use the numpy, scipy, sklearn and networkx libraries. I would like to be able to calculate the pairwise distance to all my points, etc.

Is it possible? And how can I do this, what can I start reading?

Regards // Mesmer

+22

python arrays numpy scipy bigdata

Ekgren Apr 22 '13 at 14:36

source share

2 answers

You can simply increase the virtual memory in the OS and use 64-bit python, providing it with 64-bit OS.

0

xorsyst May 1 '13 at 1:33 pm

source share

Saullo Castro · Accepted Answer · 2013-05-19 09:38

Using numpy.memmap you create arrays directly mapped to a file:

 import numpy a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000)) # here you will see a 762MB file created in your working directory

You can think of it as a regular array: a + = 1000.

You can even assign more arrays to the same file, if necessary, and manage it from mutual sources. But I have some tricky things. To open a full array, you first need to "close" the previous one using del :

 del ab = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But opening only some part of the array allows for simultaneous control:

 b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000)) b[1,5] = 123456. print a[1,5] #123456.0

Fine! a was changed along with b . And the changes are already written to disk.

Another important thing worth commenting is offset . Suppose you want to take not the first 2 lines in b , but the lines 150000 and 150001.

 b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000), offset=150000*1000*32/8) b[1,2] = 999999. print a[150001,2] #999999.0

Now you can receive and update any part of the array in simultaneous operations. Notice the byte size going into the offset calculation. So, for "float64" this example would be 150,000 * 1000 * 64/8.

Working with big data in python and numpy, not enough RAM, how to save partial results to disk?

More articles: