Python NUMPY HUGE Matrix Multiplication

I need to multiply two large matrices and sort their columns.

 import numpy
 a= numpy.random.rand(1000000, 100)
 b= numpy.random.rand(300000,100)
 c= numpy.dot(b,a.T)
 sorted = [argsort(j)[:10] for j in c.T]

This process takes a lot of time and memory. Is there a way to consolidate this process? If not, how can I calculate the RAM needed to complete this operation? I currently have an EC2 unit with 4 GB of RAM and no sharing.

I was wondering if this operation can be serialized, and I do not need to store everything in memory.

+4
source share
3 answers

, , - numpy BLAS, , . ATLAS, GOTO blas MKL Intel.

, Python Resident Set Size ( "RSS" ). UNIX (FreeBSD, , 64- ).

> ipython

In [1]: import numpy as np

In [2]: a = np.random.rand(1000, 1000)

In [3]: a.dtype
Out[3]: dtype('float64')

In [4]: del(a)

RSS, :

ps -xao comm,rss | grep python

[: . ps , ps . Linux ps ps -xao c,r.]

:

  • : 24880 kiB
  • numpy: 34364 kiB
  • a: 42200 kiB
  • a: 34368 kiB

;

In [4]: (42200 - 34364) * 1024
Out[4]: 8024064

In [5]: 8024064/(1000*1000)
Out[5]: 8.024064

, 8 float64 . .

MiB :

In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125

In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375

. :

In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484

2235 GiB!

, dot ,

  • b ndarray
  • a ndarray .
  • b .
  • del() .

, !

: (, struct ndarray.tofile). , . a numpy.memmap.

+7

DrV ; . , , .

. O(100 * 300000 * 1000000) O(k) k ( , ). , DrV , , .

, , ( - ). ( ) - ( ), .

, , - . , , "", .

+3

. Roland Smith , . , , ( , , ) .

100. 300 000 , - 1 000 000 . , . 300 000 000 000 , 1,2 2,4 , 32 64- .

(300,100) (100,1000) 1 . , 1000 ( ).

, , , . - .


, :

  • np.memmap
  • ( Roland Smith)

(2,4 ) .

. , ; , . , , , , , .

memmapped . , . 4 KiB (512 1024 ), , .


, , , . SSD- , IO ( ). , - S3 . " " , . , .


. , , , . , , . , .

, , , .

, : ? (Sparse , 0.) , , .

+1

All Articles