Creating a large random boolean matrix with numpy

I am trying to create a huge boolean matrix that is randomly filled with True and False with a given probability p . First I used this code:

 N = 30000 p = 0.1 np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p]) 

But, unfortunately, this does not end for this big N Therefore, I tried to split it into separate line generation by doing the following:

 N = 30000 p = 0.1 mask = np.empty((N, N)) for i in range (N): mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p]) if (i % 100 == 0): print(i) 

Now something strange happens (at least on my device): the first ~ 1100 lines are generated very quickly, but after that the code becomes terribly slow. Why is this happening? What am I missing here? Are there any better ways to create a large matrix that has True records with probability p and False records with probability 1-p ?

Edit : many of you assumed that there would be problems with RAM: since the device on which the code will be executed has almost 500 GB of RAM, this will not be a problem.

+18
python numpy random
source share
3 answers

The problem is your RAM, the values ​​are stored in memory when you create it. I just created this matrix using the following command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an instance of AWS i3 with 64 GB of RAM and 8 cores. To create this matrix, htop shows that it takes ~ 20 GB of RAM. Here is an example:

 time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p]) CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s Wall time: 21.7 s def mask_method(N, p): for i in range(N): mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p]) if (i % 100 == 0): print(i) time mask_method(N,p) CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s Wall time: 22.5 s 

Note that the mask method only takes up about 9 GB of RAM at peak.

Edit: the first method resets RAM after the process is complete, when the function method saves all of it.

+10
source share

So I tried to break it into a generation of single lines by doing the following:

How np.random.choice works, first creates float64 in [0, 1) for each cell of your data, and then converts to an index in your array using np.search_sorted . This intermediate representation is 8 times larger than the logical array!

Since your data is logical, you can get a factor of 2 with

 np.random.rand(N, N) > p 

Which, naturally, you could use inside your decision loop

It seems that np.random.choice might do some buffering here - you might want to write down a problem with numpy.

Another option is to try to generate float32 instead of float64 s. I'm not sure numpy can do this right now, but you can request this function.

+5
source share

Another possibility might be to generate it in a package (i.e., compute a lot of submatrices and put them together at the very end). But you should not update one array ( mask ) in a for loop, as the OP does. This will cause the entire array to load in main memory during each index update.

Instead, for example: to get 30000x30000 , you have 9000 100x100 separate arrays, update each of these arrays 100x100 respectively in a for loop, and finally connect these 9000 arrays together in a giant array. This, of course, should be no more than 4 GB of RAM and will be very fast.

Minimal example:

 In [9]: a Out[9]: array([[0, 1], [2, 3]]) In [10]: np.hstack([np.vstack([a]*5)]*5) Out[10]: array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3]]) In [11]: np.hstack([np.vstack([a]*5)]*5).shape Out[11]: (10, 10) 
0
source share

All Articles