I have a dataset and want to make it a histogram. I need the boxes to be the same size, and I mean that they should contain the same number of objects, and not a more general (numpy.histogram) problem with equally spaced cells. This, of course, will depend on the width of the bins, which may and will generally differ.
I will indicate the number of desired bins and the data set, receiving instead the edges of the bins.
Example: data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12]) bins_edges = somefunc(data, nbins=3) print(bins_edges) >> [1.,1.3,2.1,2.12]
Thus, all cells contain 2 points, but their width (0.3, 0.8, 0.02) is different.
There are two restrictions: - if the data group is identical, then the basket containing them may be larger. - if N data and M-buffers are requested, then there will be N / M bins plus one if N% M is not 0.
This piece of code is some cool code that I wrote that worked well for small datasets. What if I have 10 ** 9 + points and you want to speed up the process?
1 import numpy as np 2 3 def def_equbin(in_distr, binsize=None, bin_num=None): 4 5 try: 6 7 distr_size = len(in_distr) 8 9 bin_size = distr_size / bin_num 10 odd_bin_size = distr_size % bin_num 11 12 args = in_distr.argsort() 13 14 hist = np.zeros((bin_num, bin_size)) 15 16 for i in range(bin_num): 17 hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]] 18 19 if odd_bin_size == 0: 20 odd_bin = None 21 bins_limits = np.arange(bin_num) * bin_size 22 bins_limits = args[bins_limits] 23 bins_limits = np.concatenate((in_distr[bins_limits], 24 [in_distr[args[-1]]])) 25 else: 26 odd_bin = in_distr[args[bin_num * bin_size:]] 27 bins_limits = np.arange(bin_num + 1) * bin_size 28 bins_limits = args[bins_limits] 29 bins_limits = in_distr[bins_limits] 30 bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]])) 31 32 return (hist, odd_bin, bins_limits)
python binning histogram spacing
astabada
source share