Efficient calculation of covariance with streaming data returns in numpy

I am developing a python plugin for my proprietary portfolio simulator, which receives daily data on the return of all the tools in the portfolio when it goes through the simulations of every day.

My module needs to update its covariance matrix in the tool cross section every day. To do this, my approach so far has been:

  • The buffer returns the data inside the module to the numpy array (ret: shape = (num_days (D), num_instrs (N)) with the indices displayed in the ring buffer (data for date di goes to di% ret.shape [0]).
  • fills the covariance matrix by selecting the returned array in the 1st array with np.take (since this is a circular buffer) and calls np.cov in pairs.

This implementation has the following disadvantages:

  • Every day I have to call np.take () to transfer the ring buffer (ndarray) and therefore bear a significant copy cost. If I do not, I have the option of using np.roll (), but this is also related to copying the array.
  • Covariance calculations are performed on these temporary intermediate matrices in pairs (O (N ^ 2 * D)). It does not scale well with N (order of millions) or D (250-1000 days).

In my company, we are working on a rather large cross-section (N ~ = 1E6 - 1E7). I need my implementation to be scalable, cost-effective, and fast.

Can you suggest any improvements to the current scheme that might benefit this task?

+6
source share

All Articles