The probability density function of a histogram in python to match another histogram

I have a question regarding setting and getting random numbers.

The situation is as follows:

Firstly, I have a histogram of data points. I would like to interpret this histogram as a function of probability density (for example, using 2 free parameters) so that I can use it to create random numbers. I would also like to use this function to set another histogram.

+7
source share
1 answer

You can use the cumulative density function to generate random numbers from an arbitrary distribution, as described here .

Using a histogram to create a smooth cumulative density function is not trivial; you can use interpolation, for example scipy.interpolate.interp1d () for values ​​between the centers of your boxes, and this will work fine for a histogram with a sufficiently large number of boxes and items. However, you need to determine the shape of the tails of the probability function, that is, for values ​​smaller than the smallest bit, or greater than the largest bit. You can provide your Gaussian houses based, for example, on binding a Gaussian to your histogram) or any other tail shape that matches your problem, or just truncate the distribution.

Example:

import numpy import scipy.interpolate import random import matplotlib.pyplot as pyplot # create some normally distributed values and make a histogram a = numpy.random.normal(size=10000) counts, bins = numpy.histogram(a, bins=100, density=True) cum_counts = numpy.cumsum(counts) bin_widths = (bins[1:] - bins[:-1]) # generate more values with same distribution x = cum_counts*bin_widths y = bins[1:] inverse_density_function = scipy.interpolate.interp1d(x, y) b = numpy.zeros(10000) for i in range(len( b )): u = random.uniform( x[0], x[-1] ) b[i] = inverse_density_function( u ) # plot both pyplot.hist(a, 100) pyplot.hist(b, 100) pyplot.show() 

This does not process the tails and better processes the edges of the bin, but you can start using the histogram to generate more values ​​with the same distribution.

PS You can also try to match a certain known distribution described by several values ​​(which, I think, is what you mentioned in the question), but the above non-parametric approach is more universal.

+4
source

All Articles