Python: how to make a histogram with the same * dimensional bins

I have a dataset and want to make it a histogram. I need the boxes to be the same size, and I mean that they should contain the same number of objects, and not a more general (numpy.histogram) problem with equally spaced cells. This, of course, will depend on the width of the bins, which may and will generally differ.

I will indicate the number of desired bins and the data set, receiving instead the edges of the bins.

Example: data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12]) bins_edges = somefunc(data, nbins=3) print(bins_edges) >> [1.,1.3,2.1,2.12] 

Thus, all cells contain 2 points, but their width (0.3, 0.8, 0.02) is different.

There are two restrictions: - if the data group is identical, then the basket containing them may be larger. - if N data and M-buffers are requested, then there will be N / M bins plus one if N% M is not 0.

This piece of code is some cool code that I wrote that worked well for small datasets. What if I have 10 ** 9 + points and you want to speed up the process?

  1 import numpy as np 2 3 def def_equbin(in_distr, binsize=None, bin_num=None): 4 5 try: 6 7 distr_size = len(in_distr) 8 9 bin_size = distr_size / bin_num 10 odd_bin_size = distr_size % bin_num 11 12 args = in_distr.argsort() 13 14 hist = np.zeros((bin_num, bin_size)) 15 16 for i in range(bin_num): 17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]] 18 19 if odd_bin_size == 0: 20 odd_bin = None 21 bins_limits = np.arange(bin_num) * bin_size 22 bins_limits = args[bins_limits] 23 bins_limits = np.concatenate((in_distr[bins_limits], 24 [in_distr[args[-1]]])) 25 else: 26 odd_bin = in_distr[args[bin_num * bin_size:]] 27 bins_limits = np.arange(bin_num + 1) * bin_size 28 bins_limits = args[bins_limits] 29 bins_limits = in_distr[bins_limits] 30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]])) 31 32 return (hist, odd_bin, bins_limits) 
+8
python binning histogram spacing
source share
4 answers

Using your sample case (cells 2 points, 6 common data points):

 from scipy import stats bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1]) >> array([1. , 1.24666667, 2.05333333, 2.12]) 
+10
source share

Update for distorted distributions:

I ran into the same problem as @astabada, wanting to create bunkers, each of which contains an equal number of samples. When applying the proposed solution @ aganders3, I found that it does not work especially well for distorted distributions. In the case of skewed data (for example, something with a whole series of zeros) stats.mstats.mquantiles for a predetermined number of quantiles does not guarantee an equal number of samples in each bin. You will get the edges of the hopper that look like this:

 [0. 0. 4. 9.] 

In this case, the first bit will be empty.

To deal with distorted cases, I created a function that calls stats.mstats.mquantiles , and then dynamically changes the number of boxes if the samples are not equal within a certain tolerance (30% of the smallest sample size in the sample code) If the samples are not equal between cells , the code reduces the number of quanta with the same interval by 1 and calls stats.mstats.mquantiles again until the sample sizes are equal or only one bit exists.

I hard-coded the tolerance in this example, but if necessary, it could be changed to the keyword argument.

I also prefer to give the number of equally distributed quanta as an argument to my function, rather than giving user-defined quantiles to stats.mstats.mquantiles to reduce random errors (for example, something like [0., 0.25, 0.7, 1.] )

Here is the code:

 import numpy as np from scipy import stats def equibins(dat, binnum, **kwargs): numin = binnum while numin>1.: qtls = np.linspace(0.,1.0,num=numin,endpoint=False) ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta']) allhist, allbin = np.histogram(dat, bins = ebins) if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2: numin= numin-1 del qtls, ebins else: numin=0 return ebins def tolerence(narray, percent): if percent>1.0: per = percent/100. else: per = percent lev_tol = per*narray.min() tolerate = np.all(narray[1:]-narray[0]<lev_tol) return tolerate 
+2
source share

Just sort the data and divide it into fixed cells by length! Obviously, you can never divide into exactly the same filled bins if the number of samples is not exactly divided by the number of boxes.

 import math import numpy as np data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7]) data_sorted = np.sort(data) nbins = 3 step = math.ceil(len(data_sorted)//nbins+1) binned_data = [] for i in range(0,len(data_sorted),step): binned_data.append(data_sorted[i:i+step]) 
0
source share

I would also like to mention the existence of pandas.qcut , which makes binning fairly populated in a rather efficient way. In your case, this will work as

 data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12]) # parameter q specifies the number of bins qc = pd.qcut(data, q=3, precision=1) # bin definition bins = qc.categories print(bins) >> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object') # bin corresponding to each point in data codes = qc.codes print(codes) >> array([0, 0, 1, 1, 2, 2], dtype=int8) 
0
source share

All Articles