Time Series Aggregation Efficiency

I usually need to summarize time series with irregular time with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.

Take the aggregation function:

function aggArray = aggregate(array, groupIndex, collapseFn) groups = unique(groupIndex, 'rows'); aggArray = nan(size(groups, 1), size(array, 2)); for iGr = 1:size(groups,1) grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2); for iSer = 1:size(array, 2) aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer)); end end end 

Note that both array and groupIndex can be two-dimensional. Each column in an array is an independent row that should be aggregated, but the groupIndex columns should be taken together (as a row) to indicate the period.

Then, when we bring him an irregular time series (note that the second period is one base period longer), the synchronization results are bad:

 a = rand(20006,10); b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]); tic; aggregate(a, b, @sum); toc Elapsed time is 1.370001 seconds. 

Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s), and the iSer cycle takes about 3/4 (1.17 s) of the total (1.48 s).

Compare this to the period-indifferent case:

 tic; cumsum(a); toc Elapsed time is 0.000930 seconds. 

Is there a more efficient way to combine this data?


Sync Results

Taking each answer and putting it in a separate function, here are the synchronization results that I get using timeit with Matlab 2015b on Windows 7 with Intel i7:

  original | 1.32451 felix1 | 0.35446 felix2 | 0.16432 divakar1 | 0.41905 divakar2 | 0.30509 divakar3 | 0.16738 matthewGunn1 | 0.02678 matthewGunn2 | 0.01977 

Explanation on groupIndex

An example of a 2D groupIndex would be where the year number and week number are for a daily data set covering the period 1980-2015:

 a2 = rand(36*52*5, 10); b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])']; 

Thus, the year-week period is uniquely identified next to groupIndex . This is handled effectively by calling unique(groupIndex, 'rows') and accepting the third output, so feel free to ignore this part of the question.

+8
matlab time-series
source share
5 answers

Method # 1

You can create a mask matching grIdx for all groups at a time with bsxfun(@eq,..) . Now, for collapseFn like @sum , you can enter matrix-multiplication and thus have a fully vectorized approach, for example:

 M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = M.'*array 

For collapseFn like @mean you need to do a bit more work as shown here -

 M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array 

Method # 2

If you work with general collapseFn , you can use the 2D mask M created with the previous method to index into array strings, thereby changing the complexity from O(n^2) to O(n) . Some quick tests show that this gives noticeable speedup compared to the source code of the loop. Here's the implementation -

 n = size(groups,1); M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)); out = zeros(n,size(array,2)); for iGr = 1:n out(iGr,:) = collapseFn(array(M(:,iGr),:),1); end 

Note that 1 in collapseFn(array(M(:,iGr),:),1) indicates the size at which collapseFn will be applied, so there is 1 .


Bonus

By its name, groupIndex seems to hold integer values ​​that could be abused in order to have a more efficient creation of M , treating each row of groupIndex as an indexing tuple and thus converting each row of groupIndex to a scalar and finally get version 1D array groupIndex . This should be more efficient, since now the data will be 0(n) . This M could be applied to all approaches listed in this post. So, we would have M like this:

 dims = max(groupIndex,[],1); agg_dims = cumprod([1 dims(end:-1:2)]); [~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//' m = size(groupIndex,1); M = false(m,max(idx)); M((idx-1)*m + [1:m]') = 1; 
+8
source share

Mex Function 1

MEMORY TIME: Mex function for crushing : The basic test with the source code from the question took 1.334139 seconds on my machine. IMHO, 2nd fastest answer from @Divakar :

 groups2 = unique(groupIndex); aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array; 

Elapsed time is 0.589330 seconds.

Then my MEX function:

 [groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1)); 

The elapsed time is 0.079725 seconds.

Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15 . The results also match the source code.

Code for generating test conditions:

 array = rand(20006,10); groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]); 

For pure speed go to mex. If the thought of compiling C ++ code / complexity is too much pain, go with Divacar's answer. Another disclaimer: I have not subjected my function to reliable testing.

Mex Approach 2

Somewhat unexpectedly for me, this code appears even faster than the full Mex version in some cases (for example, in this test it took about 0.5 seconds). It uses the mex function mg_getRowsWithKey to determine group indices. I think this may be due to the fact that my copying the array in the full mex function does not happen as fast as it could be, and / or the overhead from calling "feval". This is basically the same algorithmic complexity as the other version.

 [unique_groups, map] = mg_getRowsWithKey(groupIndex); results = zeros(length(unique_groups), size(array,2)); for iGr = 1:length(unique_groups) array_subset = array(map{iGr},:); %// do your collapse function on array_subset. eg. results(iGr,:) = sum(array_subset, 1); end 

When you do array(groups(1)==groupIndex,:) to retrieve the array entries associated with the full group, you look at the ENTIRE length groupIndex. If you have millions of lines, this is completely suck. array(map{1},:) much more efficient.


There is still no need to copy memory and other overhead associated with calling feval in the collapse function. If you efficiently implement the aggregator function in C ++ in such a way as to avoid memory copying, another 2x acceleration may possibly be achieved.

+6
source share

A bit late for a party, but one cycle using accumarray is of utmost importance:

 function aggArray = aggregate_gnovice(array, groupIndex, collapseFn) [groups, ~, index] = unique(groupIndex, 'rows'); numCols = size(array, 2); aggArray = nan(numel(groups), numCols); for col = 1:numCols aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn); end end 

The timing of this use with timeit in MATLAB R2016b for sample data in question gives the following:

 original | 1.127141 gnovice | 0.002205 

More than 500 times faster!

+4
source share

The execution of the inner loop, i.e.

 function aggArray = aggregate(array, groupIndex, collapseFn) groups = unique(groupIndex, 'rows'); aggArray = nan(size(groups, 1), size(array, 2)); for iGr = 1:size(groups,1) grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2); aggArray(iGr,:) = collapseFn(array(grIdx,:)); end 

and calling the collapse function with the measurement parameter

 res=aggregate(a, b, @(x)sum(x,1)); 

gives some acceleration (3x on my machine) already and avoids errors, for example. the sum or average result when they encounter a single row of data without a measurement parameter, and then collapse by columns rather than labels.

If you have only one group label vector, i.e. the same group labels for all data columns, you can accelerate further progress:

 function aggArray = aggregate(array, groupIndex, collapseFn) ng=max(groupIndex); aggArray = nan(ng, size(array, 2)); for iGr = 1:ng aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:)); end 

The latter functions give the same results for your example with 6x acceleration, but cannot handle different group labels in the data column.

Assume a 2D test for a group index (there are also 10 different columns for groupIndex here:

 a = rand(20006,10); B=[]; % make random length periods for each of the 10 signals for i=1:size(a,2) n0=randi(10); b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]); B=[B b]; end tic; erg0=aggregate(a, B, @sum); toc % original method tic; erg1=aggregate2(a, B, @(x)sum(x,1)); toc %just remove the inner loop tic; erg2=aggregate3(a, B, @(x)sum(x,1)); toc %use function below 

The elapsed time is 2.646297 seconds. The elapsed time is 1.214365 seconds. The elapsed time is 0.039678 seconds (!!!!).

 function aggArray = aggregate3(array, groupIndex, collapseFn) [groups,ix1,jx] = unique(groupIndex, 'rows','first'); [groups,ix2,jx] = unique(groupIndex, 'rows','last'); ng=size(groups,1); aggArray = nan(ng, size(array, 2)); for iGr = 1:ng aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:)); end 

I think it is as fast as without using MEX. Thanks to Matthew Gunn's suggestion! Profiling shows that the β€œunique” here is really cheap, and pulling out only the first and last index of duplicate rows in groupIndex speeds up the work. I get 88x acceleration with this iteration of aggregation.

+3
source share

Well, I have a solution that is almost as fast as mex, but only using Matlab. The logic is the same as most of the above, creating a dummy 2D matrix, but instead of using @eq, I initialize the logical array from the start.

The elapsed time for the mine is 0.172975 seconds. The elapsed time for Divakar is 0.289122 seconds.

 function aggArray = aggregate(array, group, collapseFn) [m,~] = size(array); n = max(group); D = false(m,n); row = (1:m)'; idx = m*(group(:) - 1) + row; D(idx) = true; out = zeros(m,size(array,2)); for ii = 1:n out(ii,:) = collapseFn(array(D(:,ii),:),1); end end 
0
source share

All Articles