Creating a large random boolean matrix with numpy

Question

Creating a large random boolean matrix with numpy

I am trying to create a huge boolean matrix that is randomly filled with True and False with a given probability p . First I used this code:

 N = 30000 p = 0.1 np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

But, unfortunately, this does not end for this big N Therefore, I tried to split it into separate line generation by doing the following:

 N = 30000 p = 0.1 mask = np.empty((N, N)) for i in range (N): mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p]) if (i % 100 == 0): print(i)

Now something strange happens (at least on my device): the first ~ 1100 lines are generated very quickly, but after that the code becomes terribly slow. Why is this happening? What am I missing here? Are there any better ways to create a large matrix that has True records with probability p and False records with probability 1-p ?

Edit : many of you assumed that there would be problems with RAM: since the device on which the code will be executed has almost 500 GB of RAM, this will not be a problem.

+18

python numpy random

Flashtek Apr 20 '17 at 19:49

source share

3 answers

aws_apprentice · Answer 1 · 2017-04-20T20:07:39+0000

The problem is your RAM, the values are stored in memory when you create it. I just created this matrix using the following command:

np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])

I used an instance of AWS i3 with 64 GB of RAM and 8 cores. To create this matrix, htop shows that it takes ~ 20 GB of RAM. Here is an example:

 time np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p]) CPU times: user 18.3 s, sys: 3.4 s, total: 21.7 s Wall time: 21.7 s def mask_method(N, p): for i in range(N): mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p]) if (i % 100 == 0): print(i) time mask_method(N,p) CPU times: user 20.9 s, sys: 1.55 s, total: 22.5 s Wall time: 22.5 s

Note that the mask method only takes up about 9 GB of RAM at peak.

Edit: the first method resets RAM after the process is complete, when the function method saves all of it.

Eric · Answer 2 · 2017-04-20T23:08:32+0000

So I tried to break it into a generation of single lines by doing the following:

How np.random.choice works, first creates float64 in [0, 1) for each cell of your data, and then converts to an index in your array using np.search_sorted . This intermediate representation is 8 times larger than the logical array!

Since your data is logical, you can get a factor of 2 with

 np.random.rand(N, N) > p

Which, naturally, you could use inside your decision loop

It seems that np.random.choice might do some buffering here - you might want to write down a problem with numpy.

Another option is to try to generate float32 instead of float64 s. I'm not sure numpy can do this right now, but you can request this function.

kmario23 · Answer 3 · 2017-04-20T21:26:18+0000

Another possibility might be to generate it in a package (i.e., compute a lot of submatrices and put them together at the very end). But you should not update one array ( mask ) in a for loop, as the OP does. This will cause the entire array to load in main memory during each index update.

Instead, for example: to get 30000x30000 , you have 9000 100x100 separate arrays, update each of these arrays 100x100 respectively in a for loop, and finally connect these 9000 arrays together in a giant array. This, of course, should be no more than 4 GB of RAM and will be very fast.

Minimal example:

 In [9]: a Out[9]: array([[0, 1], [2, 3]]) In [10]: np.hstack([np.vstack([a]*5)]*5) Out[10]: array([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1], [2, 3, 2, 3, 2, 3, 2, 3, 2, 3]]) In [11]: np.hstack([np.vstack([a]*5)]*5).shape Out[11]: (10, 10)

Creating a large random boolean matrix with numpy

More articles: