How can I generate a random sample of bin counts given a sequence of bin probabilities?

I have an integer that needs to be divided into cells according to the probability distribution. For example, if I had N=100 objects included in [0.02, 0.08, 0.16, 0.29, 0.45] , you could get [1, 10, 20, 25, 44] .

 import numpy as np # sample distribution d = np.array([x ** 2 for x in range(1,6)], dtype=float) d = d / d.sum() dcs = d.cumsum() bins = np.zeros(d.shape) N = 100 for roll in np.random.rand(N): # grab the first index that the roll satisfies i = np.where(roll < dcs)[0][0] bins[i] += 1 

In fact, N and my number of boxes are very large, so the loop is not really a viable option. Is there a way I can vectorize this operation to speed it up?

+5
source share
2 answers

You can convert your PDF to CDF by taking cumsum, use this to determine the set of cells between 0 and 1, and then use these cells to calculate the histogram of a random, uniform N-length vector:

 cdf = np.cumsum([0, 0.02, 0.08, 0.16, 0.29, 0.45]) # leftmost bin edge = 0 counts, edges = np.histogram(np.random.rand(100), bins=cdf) print(counts) # [ 4, 8, 16, 30, 42] 
+4
source

You can use np.bincount for a np.bincount operation along with np.searchsorted to perform the equivalent of a roll < dcs operation. Here's the implementation to fulfill these promises -

 bins = np.bincount(np.searchsorted(dcs,np.random.rand(N),'right')) 

Run-time test using the specified parameters -

 In [72]: %%timeit ...: for roll in np.random.rand(N): ...: # grab the first index that the roll satisfies ...: i = np.where(roll < dcs)[0][0] ...: bins[i] += 1 ...: 1000 loops, best of 3: 721 ยตs per loop In [73]: %%timeit ...: np.bincount(np.searchsorted(dcs,np.random.rand(N),'right')) ...: 100000 loops, best of 3: 13.5 ยตs per loop 
+2
source

All Articles