Compute cumulative distribution function (CDF) in Python

How can I compute cumulative distribution function (CDF) in python?

I want to calculate it from an array of points that I have (discrete distribution), and not with continuous distributions, which, for example, have scipy.

+10
python numpy scipy statistics machine-learning
source share
2 answers

(It is possible that my interpretation of the question is incorrect. If the question is how to switch from discrete PDF to discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equally spaced. If the array is not np.cumsum distributed, then np.cumsum array times the distance between the points.)

If you have a discrete sample array and you want to know the CDF sample, then you can simply sort the array. If you look at the sorted result, you will realize that the smallest value represents 0%, and the largest value represents 100%. If you want to know the value at 50% of the distribution, just look at the element of the array that is in the middle of the sorted array.

Let's take a closer look at this with a simple example:

 import matplotlib.pyplot as plt import numpy as np # create some randomly ddistributed data: data = np.random.randn(10000) # sort the data: data_sorted = np.sort(data) # calculate the proportional values of samples p = 1. * np.arange(len(data)) / (len(data) - 1) # plot the sorted data: fig = figure() ax1 = fig.add_subplot(121) ax1.plot(p, data_sorted) ax1.set_xlabel('$p$') ax1.set_ylabel('$x$') ax2 = fig.add_subplot(122) ax2.plot(data_sorted, p) ax2.set_xlabel('$x$') ax2.set_ylabel('$p$') 

This gives the following graph, where the graph on the right side is a traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally this is not so, as long as the number of points is finite.

cumulative distribution function

This feature is easy to invert, and it depends on your application, what form you need.

+19
source share

Assuming that you know how your data is distributed (i.e. , you know the pdf of your data), Scipy supports discrete data when calculating cdf.

 import numpy as np import scipy import matplotlib.pyplot as plt import seaborn as sns x = np.random.randn(10000) # generate samples from normal distribution (discrete data) norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete # plot the cdf sns.lineplot(x=x, y=norm_cdf) plt.show() 

enter image description here

We can even print the first few cdf values ​​to show that they are discrete

 print(norm_cdf[:10]) >>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329, 0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ]) 

The same method for calculating cdf also works for several dimensions: we use the 2d data below to illustrate

 mu = np.zeros(2) # mean vector cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix # generate 2d normally distributed samples using 0 mean and the covariance matrix above x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples norm_cdf = scipy.stats.norm.cdf(x) print(norm_cdf.shape) >>> (1000, 2) 

In the examples above, I knew that my data was usually distributed, so I used scipy.stats.norm() - there are several distributions that scipy supports. But then again, you need to know how your data is distributed in advance in order to use such functions. If you do not know how your data is distributed, and you just use any distribution to calculate cdf, most likely you will get incorrect results.

+1
source share

All Articles