Very large matrices using Python and NumPy

NumPy is an extremely useful library, and from its use I found that it is able to process matrices that are quite large (10,000 x 10,000) easily, but begin to struggle with something big (trying to create a 50,000 x 50,000 matrix). Obviously, this is due to the huge memory requirements.

Is there a way to create huge matrices initially in NumPy (say, 1 million per 1 million) in some way (without a few terabytes of RAM)?

+78
python numpy matrix
Jun 28 '09 at 0:32
source share
12 answers

PyTables and NumPy are the way to go.

PyTables will store data on disk in HDF format with additional compression. My datasets often get 10x compression, which is handy when working with tens or hundreds of millions of rows. It is also very fast; my 5-year-old laptop can crunch through data performing SQL-like GROUP BY aggregation at a speed of 1,000,000 rows per second. Not bad for a Python based solution!

Accessing data as a NumPy re-entry is as easy as:

data = table[row_from:row_to] 

The HDF library will take care of reading in the appropriate chunks of data and converting to NumPy.

+86
Jun 30 '09 at 9:11
source share

numpy.array are designed to work in memory. If you want to work with matrices larger than your RAM, you will have to get around this. You can follow at least two approaches:

  • Try a more efficient matrix representation that uses any special structure your matrices have. For example, as others have already noted, there are efficient data structures for sparse matrices (matrices with a large number of zeros), for example scipy.sparse.csc_matrix .
  • Change your algorithm for working with submatrices . You can read from the disk only the matrix blocks that are currently used in calculations. Algorithms designed to work on clusters usually work block by block, since data is interrupted on different computers and transmitted only when necessary. For example, Fox's matrix multiplication algorithm (PDF file) .
+53
Jun 28 '09 at 2:53
source share

You should be able to use numpy.memmap for a memory card on disk. With the new python and 64-bit machine, you should have the necessary address space without loading everything into memory. The OS should only process part of the file in memory.

+30
Jun 28 '09 at 1:46
source share

To handle sparse matrices, you need a scipy package that sits on top of numpy - see here for more details on the sparse matrix options that scipy gives you.

+24
Jun 28 '09 at 2:23
source share

Stefano Borini's message made me look at how far this is already about.

It is he. It seems basically what you want. HDF5 allows you to store very large datasets, and then use and use them the same way NumPy does.

+11
Jun 28 '09 at 2:54
source share

Make sure you are using a 64-bit operating system and a 64-bit version of Python / NumPy. Please note that on 32-bit architectures you can access, as a rule, 3 GB of memory (with approximately 1 GB lost for memory I / O, etc.).

With 64-bit and arrays of things larger than the available RAM, you can leave with virtual memory, although everything will be slower if you have to swap places. In addition, memory cards (see Numpy.memmap) are a way to work with huge files on disk without loading them into memory, but for this you need to have a 64-bit address space for this. PyTables will do most of this for you too.

+5
Aug 19 '09 at 0:27
source share

It's a bit alpha, but http://blaze.pydata.org/ seems to be working on a solution to this.

+5
Feb 05 '13 at 0:58
source share

You ask how to process a matrix of elements of size 2,500,000,000 without terabytes of RAM?

The way to process 2 billion elements without 8 billion bytes of RAM is not to store the matrix in memory.

This means much more complex algorithms for extracting it from the file system into chunks.

+4
Jun 28 '09 at 2:32
source share

Sometimes, one simple solution uses a custom type for your matrix elements. Based on the range of numbers you need, you can use the dtype manual dtype and especially less for your products. Since Numpy considers the largest object type by default, this can be a useful idea in many cases. Here is an example:

 In [70]: a = np.arange(5) In [71]: a[0].dtype Out[71]: dtype('int64') In [72]: a.nbytes Out[72]: 40 In [73]: a = np.arange(0, 2, 0.5) In [74]: a[0].dtype Out[74]: dtype('float64') In [75]: a.nbytes Out[75]: 32 

And with a custom type:

 In [80]: a = np.arange(5, dtype=np.int8) In [81]: a.nbytes Out[81]: 5 In [76]: a = np.arange(0, 2, 0.5, dtype=np.float16) In [78]: a.nbytes Out[78]: 8 
+3
03 Oct '16 at 22:09
source share

Usually, when we are dealing with large matrices, we implement them as Sparse matrices .

I don't know if numpy supports sparse matrices, but I found this instead.

+1
Jun 28 '09 at 0:45
source share

As far as I know about numpy, no, but I could be wrong.

I can offer you this alternative solution: write a matrix on the disk and access it in pieces. I offer you the HDF5 file format. If you need it transparently, you can override the ndarray interface to split your memory into disk in memory. Be careful if you modify the data to sync it to disk.

+1
Jun 28 '09 at 0:46
source share

You can run your code on Google Colab . Google Colab is a free cloud service and now it supports a free GPU! I could build (870199 * 14425) a matrix in Google Colab which I could not run on my PC.

0
Jan 31 '19 at 21:20
source share



All Articles