How can I use numpy.correlate for autocorrelation?

I need to make autocorrelation of a set of numbers, which, as I understand it, is simply a correlation of the set with itself.

I tried this using the numpy correlation function, but I do not believe the result, since it almost always gives a vector where the first number is not the largest, as it should be.

So this question is actually two questions:

  1. What exactly does numpy.correlate do?
  2. How can I use this (or something else) to do autocorrelation?
+85
python math numpy numerical-methods
Mar 13 '09 at 17:07
source share
13 answers

To answer your first question, numpy.correlate(a, v, mode) convolves a with the opposite sign of v and gives the results trimmed by the specified mode. the convolution definition , C (t) = Σ -∞ <i <∞ a i v t + i where -∞ <t <∞ allows you to get results from -∞ to ∞, but you obviously cannot store an infinitely long array. Thus, it must be trimmed, and it is here that the mode is turned on. There are 3 different modes: full, identical and valid:

  • "full" mode returns results for each t , where both a and v have some overlap.
  • β€œsame” mode returns a result with the same length as the shortest vector ( a or v ).
  • The "real" mode returns results only when a and v completely overlap. The documentation for numpy.convolve gives more details about the modes.

For your second question, I think numpy.correlate gives you autocorrelation, it just gives you a little more. Autocorrelation is used to determine how much a similar signal or function is for itself with a certain time difference. With a time difference of 0, autocorrelation should be the highest because the signal is identical to itself, so you expected the first element in the array of autocorrelation results to be the largest. However, the correlation does not start with a time difference of 0. It starts with a negative time difference, closes to 0 and then becomes positive. That is, you expected:

autocorrelation (a) = Σ -∞ <i <∞ a i v t + i , where 0 <= t <∞

But you had:

autocorrelation (a) = Σ -∞ <i <∞ a i v t + i where -∞ <t <∞

What you need to do is take the last half of the correlation result, and that should be the autocorrelation you are looking for. A simple python function to do this:

 def autocorr(x): result = numpy.correlate(x, x, mode='full') return result[result.size/2:] 

Of course, you will need error checking to make sure that x is actually the 1st array. In addition, this explanation is probably not the most mathematically rigorous one. I cheated on infinities because the definition of convolution uses them, but this does not necessarily apply to autocorrelation. Thus, the theoretical part of this explanation may be slightly disgusting, but I hope the practical results are useful. These autocorrelation pages are very useful and can give you a much better theoretical background if you don't mind wading through notation and hard concepts.

+96
Mar 24 '09 at 6:09
source share

Use the numpy.corrcoef function instead of numpy.correlate to calculate the statistical correlation for delay t:

 def autocorr(x, t=1): return numpy.corrcoef(numpy.array([x[:-t], x[t:]])) 
+17
Feb 18 '14 at 19:10
source share

Autocorrelation comes in two versions: statistical and convolution. They both do the same, except for a little detail: the statistical version is normalized to the interval [-1, 1]. Here is an example of how you do a statistic:

 def acf(x, length=20): return numpy.array([1]+[numpy.corrcoef(x[:-i], x[i:])[0,1] \ for i in range(1, length)]) 
+16
Nov 02 '11 at 13:27
source share

As soon as I ran into the same problem, I would like to share a few lines of code with you. In fact, there are currently some pretty similar posts about autocorrelation in stackoverflow. If you define autocorrelation as a(x, L) = sum(k=0,NL-1)((xk-xbar)*(x(k+L)-xbar))/sum(k=0,N-1)((xk-xbar)**2) [this is the definition given in the IDL function a_correlate, and it is consistent with what I see in answer 2 of question # 12269834 ], then the following results give the correct results:

 import numpy as np import matplotlib.pyplot as plt # generate some data x = np.arange(0.,6.12,0.01) y = np.sin(x) # y = np.random.uniform(size=300) yunbiased = y-np.mean(y) ynorm = np.sum(yunbiased**2) acor = np.correlate(yunbiased, yunbiased, "same")/ynorm # use only second half acor = acor[len(acor)/2:] plt.plot(acor) plt.show() 

As you can see, I checked this with a sin curve and a uniform random distribution, and both results look as I would expect them. Note that I used mode="same" instead of mode="full" , as others did.

+11
Jun 13 '13 at 14:52
source share

Your question 1 has already been discussed in detail here in several excellent answers.

I was thinking of sharing with you a few lines of code that will allow you to calculate the autocorrelation of a signal based only on the mathematical properties of autocorrelation. That is, autocorrelation can be calculated as follows:

  • subtract the average from the signal and get an unbiased signal

  • calculate the Fourier transform of an unbiased signal

  • calculate the spectral power density of the signal by taking the square norm of each Fourier transform of the unbiased signal

  • calculate the inverse Fourier transform of the power spectral density

  • normalize the inverse Fourier transform of the power spectral density by the sum of the squares of the unbiased signal and take only half of the resulting vector

Code for this:

 def autocorrelation (x) : """ Compute the autocorrelation of the signal, based on the properties of the power spectral density of the signal. """ xp = x-np.mean(x) f = np.fft.fft(xp) p = np.array([np.real(v)**2+np.imag(v)**2 for v in f]) pi = np.fft.ifft(p) return np.real(pi)[:x.size/2]/np.sum(xp**2) 
+9
Oct. 20 '16 at 12:46 on
source share

I think there are two things that add confusion to this topic:

  1. statistical determination and signal processing: as others have already indicated, in statistics we normalize autocorrelation to [-1, 1].
  2. partial compared to incomplete average / dispersion: when the time series is shifted with a delay> 0, their overlap size will always be <the original length. Whether we use the mean and standard deviation of the original (non-partial), or always calculate the new average, and the standard deviation using an ever-changing overlap (partial) matters. (Perhaps there is a formal term for this, but for now I will use "partial").

I created 5 functions that calculate the autocorrelation of a 1d array, with partial and non-partial differences. Some use a formula from statistics, some use correlates in the sense of signal processing, which can also be done through FFT. But all the results are autocorrelation in the definition of statistics , so they illustrate how they relate to each other. Code below:

 import numpy import matplotlib.pyplot as plt def autocorr1(x,lags): '''numpy.corrcoef, partial''' corr=[1. if l==0 else numpy.corrcoef(x[l:],x[:-l])[0][1] for l in lags] return numpy.array(corr) def autocorr2(x,lags): '''manualy compute, non partial''' mean=numpy.mean(x) var=numpy.var(x) xp=x-mean corr=[1. if l==0 else numpy.sum(xp[l:]*xp[:-l])/len(x)/var for l in lags] return numpy.array(corr) def autocorr3(x,lags): '''fft, pad 0s, non partial''' n=len(x) # pad 0s to 2n-1 ext_size=2*n-1 # nearest power of 2 fsize=2**numpy.ceil(numpy.log2(ext_size)).astype('int') xp=x-numpy.mean(x) var=numpy.var(x) # do fft and ifft cf=numpy.fft.fft(xp,fsize) sf=cf.conjugate()*cf corr=numpy.fft.ifft(sf).real corr=corr/var/n return corr[:len(lags)] def autocorr4(x,lags): '''fft, don't pad 0s, non partial''' mean=x.mean() var=numpy.var(x) xp=x-mean cf=numpy.fft.fft(xp) sf=cf.conjugate()*cf corr=numpy.fft.ifft(sf).real/var/len(x) return corr[:len(lags)] def autocorr5(x,lags): '''numpy.correlate, non partial''' mean=x.mean() var=numpy.var(x) xp=x-mean corr=numpy.correlate(xp,xp,'full')[len(x)-1:]/var/len(x) return corr[:len(lags)] if __name__=='__main__': y=[28,28,26,19,16,24,26,24,24,29,29,27,31,26,38,23,13,14,28,19,19,\ 17,22,2,4,5,7,8,14,14,23] y=numpy.array(y).astype('float') lags=range(15) fig,ax=plt.subplots() for funcii, labelii in zip([autocorr1, autocorr2, autocorr3, autocorr4, autocorr5], ['np.corrcoef, partial', 'manual, non-partial', 'fft, pad 0s, non-partial', 'fft, no padding, non-partial', 'np.correlate, non-partial']): cii=funcii(y,lags) print(labelii) print(cii) ax.plot(lags,cii,label=labelii) ax.set_xlabel('lag') ax.set_ylabel('correlation coefficient') ax.legend() plt.show() 

Here is the output indicator:

enter image description here

We do not see all 5 lines, because 3 of them overlap (in purple). Overlapping is all incomplete autocorrelation. This is because calculations from signal processing methods ( np.correlate , FFT) do not calculate a different mean / standard deviation for each overlap.

Also note that the result of fft, no padding, non-partial (red line) is different because it did not fill the time series with zeros before performing FFT, therefore it is cyclic FFT. I can not explain in detail why, what I learned from others.

+9
Jul 04 '18 at 7:32
source share

I am using talib.CORREL for autocorrelation like this, I suspect you can do the same with other packages:

 def autocorrelate(x, period): # x is a deep indicator array # period of sample and slices of comparison # oldest data (period of input array) may be nan; remove it x = x[-np.count_nonzero(~np.isnan(x)):] # subtract mean to normalize indicator x -= np.mean(x) # isolate the recent sample to be autocorrelated sample = x[-period:] # create slices of indicator data correls = [] for n in range((len(x)-1), period, -1): alpha = period + n slices = (x[-alpha:])[:period] # compare each slice to the recent sample correls.append(ta.CORREL(slices, sample, period)[-1]) # fill in zeros for sample overlap period of recent correlations for n in range(period,0,-1): correls.append(0) # oldest data (autocorrelation period) will be nan; remove it correls = np.array(correls[-np.count_nonzero(~np.isnan(correls)):]) return correls # CORRELATION OF BEST FIT # the highest value correlation max_value = np.max(correls) # index of the best correlation max_index = np.argmax(correls) 
+1
Dec 22 '15 at 5:06
source share

I think the real answer to the OP question is summarized in this excerpt from the Numpy.correlate documentation:

 mode : {'valid', 'same', 'full'}, optional Refer to the `convolve` docstring. Note that the default is `valid`, unlike `convolve`, which uses `full`. 

This means that if you use "mode" without a definition, the Numpy.correlate function will return a scalar if it is given the same vector for two input arguments (that is, when used to perform autocorrelation).

0
Feb 01 '15 at 21:17
source share

I am a computational biologist, and when I needed to calculate auto / np.correlate correlations between pairs of time series of random processes, I realized that np.correlate does not do the work I need.

Indeed, in np.correlate , apparently, there is no averaging over all possible pairs of time points at a distance of 𝜏.

This is how I defined a function that does what I need:

 def autocross(x, y): c = np.correlate(x, y, "same") v = [c[i]/( len(x)-abs( i - (len(x)/2) ) ) for i in range(len(c))] return v 

It seems to me that none of the previous answers covers this case of auto / cross-correlation: I hope this answer can be useful for someone working on random processes like me.

0
Apr 12 '18 at 15:25
source share

A simple solution without pandas:

 import numpy as np def auto_corrcoef(x): return np.corrcoef(x[1:-1], x[2:])[0,1] 
0
Jul 23 '18 at 13:22
source share

Using the Fourier Transform and the Convolution Theorem

Time complexity N * log (N)

 def autocorr1(x): r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real return r2[:len(x)//2] 

Here is a normalized and unbiased version, it is also N * log (N)

 def autocorr2(x): r2=np.fft.ifft(np.abs(np.fft.fft(x))**2).real c=(r2/x.shape-np.mean(x)**2)/np.std(x)**2 return c[:len(x)//2] 

The method provided by A. Levy works, but I tested it on my PC, its time complexity seems to be N * N

 def autocorr(x): result = numpy.correlate(x, x, mode='full') return result[result.size/2:] 
0
Sep 17 '18 at 6:20
source share

Build statistical autocorrelation with PANDAS DataTime data. Series of returns:

 import matplotlib.pyplot as plt def plot_autocorr(returns, lags): autocorrelation = [] for lag in range(lags+1): corr_lag = returns.corr(returns.shift(-lag)) autocorrelation.append(corr_lag) plt.plot(range(lags+1), autocorrelation, '--o') plt.xticks(range(lags+1)) return np.array(autocorrelation) 
0
Oct 08 '18 at 12:42
source share

An alternative to numpy.correlate is available at statsmodels.tsa.stattools.acf () . This gives a continuously decreasing autocorrelation function similar to that described in the OP. Implementing this is quite simple:

 from statsmodels.tsa import stattools # x = 1-D array # Yield normalized autocorrelation function of number lags autocorr = stattools.acf( x ) # Get autocorrelation coefficient at lag = 1 autocorr_coeff = autocorr[1] 

The default behavior stops at 40 nlags, but this can be adjusted using the nlag= parameter for your specific application. At the bottom of the page there is a link to the statistics of this function .

0
Apr 24 '19 at 18:38
source share



All Articles