Octave / Matlab: defining unique rows in a very large matrix

I have a 13146 x 13146 matrix in Octave whose unique rows I want to define. unique(M, "rows") fails due to internal and / or Octave constraints.

I looked at other messages about finding unique rows, but none of them addressed this problem with large matrices.

Now my approach would be to โ€œdivide and winโ€, for example. to

 A(:,:,i)=M(r_i:s_i,:) B(:,:,i)=unique(A(:,:,i), "rows") 

with i submatrix index, r_i and s_i starting and ending row numbers of submatrices. To return all data in one large matrix (and again define unique rows):

 C(:,:,i)=B(:,:,i)' D=reshape(C,l,n*m) E=D' F=unique(E, "rows") 

with n number of submatrices, m initial number of rows in the submatrix, and l number of columns.

Is there a better way to achieve the desired result?

+5
source share
3 answers

You can use a sliding window in the columns and use a window size that does not cause memory problems. Here is the solution

 function A = winuninque(A, winSz) nCol = size(A,2); I = zeros(size(A,1), 0); for k=1:winSz:nCol [~, ~, I] = unique([IA(:,k:min(k+winSz-1,end))], 'rows'); end [~, I] = unique(I); A = A(I, :); end 

Benchmark

To really have multiple repeating rows, itโ€™s better to create a matrix with some duplicates, otherwise it will just be sorted. Here is a comparison between the various approaches:

 >> A=repmat(rand(13146,13146), 2, 1); >> A=A(randperm(end), :); >> A=A(1:end/2,:); >> tic; B=rrunique(A); toc Elapsed time is 13.318752 seconds. >> tic; C=winunique(A, 16); toc Elapsed time is 6.606122 seconds. >> tic; D=unique(A, 'rows'); toc Elapsed time is 29.815333 seconds. >> isequal(B,C,D) ans = 1 >> size(D) ans = 9880 13146 
+1
source

Radix sort it!

It looks like you need a memory sorting algorithm. Unique rows can be found by first sorting the rows and then checking the adjacent rows for duplicates. To do this, you can adapt the sorting of the radix, sorting by each column sequentially (instead of each digit in the sequence). This will be the peak memory cost for sorting one column, not the entire matrix. Then go through the lines in the sorted result and eliminate duplicates. This operation is O(n) and requires enough memory to hold two lines.

It can also be "stable." If you track the rearranged row indexes in addition to the rearranged row values โ€‹โ€‹during sorting, you can calculate the I / O conversion indices. (This is I in the signature Matlab own [B,I] = sort(A) .) This, in turn, will allow you to rearrange the lines after deleting duplicates back to the original input order so that you can keep their order. (As an option, setOrder='stable' for Matlab unique() .) They can also be used to compute in-out display indices for the overall uniqueness operation, so you can reproduce the full multi-screen signature unique() , which can be quite useful.

Code example

Here is an example implementation of a basic example. I have not tested it completely, so do not use it in production without your own testing.

 function A = rrunique(A) %RRUNIQUE "Radix Row Unique" - find unique rows using radix sort % % # Returns the distinct rows in A. Uses the memory-efficient radix sort % # algorithm, so peak memory usage stays low(ish) for large matrices. % # This uses a modified radix sort where only the row remapping indexes are % # rearranged at each step, instead of sorting the whole input, to avoid % # having to rewrite the large input matrix. ix = 1:size(A,1); % # Final in-out mapping indexes % # Radix sort the rows for iCol = size(A,2):-1:1 c = A(ix,iCol); [~,ixStep] = sort(c); % # Don't do this! too slow % # A = A(ixStep,:); % # Just reorder the mapping indexes ix = ix(ixStep); end % # Now, reorder the big array all at once A = A(ix,:); % # Remove duplicates tfKeep = true(size(A,1),1); priorRow = A(1,:); for iRow = 2:size(A,1) thisRow = A(iRow,:); if isequal(thisRow, priorRow) tfKeep(iRow) = false; else priorRow = thisRow; end end A = A(tfKeep,:); end 

When I tested this on a matrix of your size on the Matlab R2014b on OS X, it peaked at about 3 GB of used memory, against about 1 GB to hold only the input matrix. Not bad.

 >> m = rand([13146,13146]); >> tic; rrunique(m); toc Elapsed time is 17.435783 seconds. 
+6
source

The basic idea here is the same as with Andrew answer , only the implementation in the second stage is slightly different. So, we sort the rows of the input array, duplicating each other. Then we look at the lines that look for duplicates that can be efficiently executed using diff . From the output of diff we find all zeros lines that represent these duplicate lines. We create a logical mask from the discovery and use this mask to extract valid rows from the input array. Here's the implementation and it seems to use half the memory than with unique(...'rows') -

 sM = sortrows(M); out = sM([true ; any(diff(sM,[],1)~=0,2)],:); 
0
source

All Articles