Python big matrix multiplication - what is the best option?

I have two boolean sparse square matrices with. 80,000 x 80,000 created from 12BM data (and most likely I will have orders of large matrices when I use GBs of data).

I want to multiply them (which creates a triangular matrix - however, I do not get this, since I do not limit the product of points to the triangular matrix).

I am wondering what is the best way to multiply them (memory and speed) - I am going to perform calculations on an m2.4xlarge AWS instance that has> 60 GB of RAM. I would rather keep calc in RAM for speed reasons.

I appreciate that SciPy has sparse matrices as well as h5py, but also has no experience.

What is the best option?

Thanks in advance

UPDATE: sparse Boolean matrices is <0.6%

+7
python numpy sparse-matrix pytables h5py
source share
2 answers

If your matrices are relatively empty, it may be advisable to encode them as a data structure of values โ€‹โ€‹other than False. Say a list of tuples describing the location of values โ€‹โ€‹other than False. Or a dictionary with tuples as keys.

If you use, for example, a list of tuples, you can use list comprehension to find items in the second list that can be multiplied by an item from the first list.

a = [(0,0), (3,7), (5,2)] # et cetera b = ... # idem for r, c in a: res = [(r, k) for j, k in b if k == j] 
+1
source share

- PICTURE MEET BELOW COMMENTS / DOWNVOTER -

You ask how quick and easy to multiply matrices.

DECISION 1 . This is a solvable problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, they are developing quite quickly.

also see:

SciPy and Numpy have sparse matrices and matrix multiplication. It does not use a lot of memory, because (at least if I wrote it in C), it probably uses linked lists and, therefore, will only use the memory needed for the amount of data, as well as some overhead. And it will almost certainly be incredibly fast compared to a pure python solution.

SOLUTION 2

Another answer here involves storing values โ€‹โ€‹as tuples (x, y), the assumed value is False, if it does not exist, then it is true. An alternative to this is a numerical matrix with roots (x, y, value).

DISORDER: Multiplication by them will be Nasty by time: find the element one, decide which other element of the array will multiply, then search the entire data set for this specific tuple, and if it exists, multiply and paste the result into the result matrix.

DECISION 3 (PREFERRED versus Decision 2, IMHO)

I would prefer this because it is simpler / faster.

Imagine your sparse matrix with a set of dictionaries. Matrix one is a dict with an at (x, y) element and a v value (with x1, y1, x2, y2, etc.):

 matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... } matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... } 

Since Python type searches are O (1) (well, actually, probably closer to log (n)), this is fast. This does not require a search in all second matrix data for the presence of an element before multiplication. So, this is fast. It is easy to write repeatedly and it is easy to understand representations.

DECISION 4 (if you are a glutton for punishment)

Build this solution using a memory mapped file of the required size. Initialize a file with zero values โ€‹โ€‹of the required size. Calculate the offsets yourself and write to the appropriate places in the file how you do the multiplication. Linux has a VMM that will load and run for you with little overhead or work on your part. This solution is for very, very large matrices NOT SPARSE and therefore will not fit into memory.

Please note that resolves the complaint below this complainant that it will not fit in memory. However, the OP said sparse , which implies very little evidence spread across giant arrays, and Numpy / SciPy handles this initially and thus beautifully (many people at Fermilab regularly use Numpy / SciPy, I am sure sparse matrix code is well tested).

-one
source share

All Articles