Find unique rows of an array of cells that take into account all possible permutations in each row

Question

Find unique rows of an array of cells that take into account all possible permutations in each row

I have an array of cells A dimension m * k .

I want to save A unique strings to the order of k-cells .

the "difficult" part is up to the order of k-cells " : consider cells k in the i th row A , A(i,:) ; there may be a string j A , A(j,:) , which is equivalent to A(i,:) up to the re-ordering of its cells k , which means that, for example, if k=4 , it could be like this:

 A{i,1}=A{j,2} A{i,2}=A{j,3} A{i,3}=A{j,1} A{i,4}=A{j,4}

I am currently doing the following:

 G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; h=7; M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2)); A=cell(size(M,1),2); for p=1:size(M,1) A{p,1}=squeeze(M(p,:,:)); left=~ismember(G, A{p,1}, 'rows'); A{p,2}=G(left,:); end %To find equivalent rows up to order I use a double loop (VERY slow). indices=[]; for j=1:size(A,1) if ismember(j,indices)==0 %if we have not already identified j as a duplicate for i=1:size(A,1) if i~=j if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))... &&... (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))... indices=[indices;i]; end end end end end A(indices,:)=[];

It works, but it is too slow. I hope there is something faster that I can use.

+8

performance matlab permutation cell-array

user3285148 Oct 10 '16 at 9:47

source share

3 answers

Indication of a problem: An ideal choice when identifying unique rows in an array is to use C = unique(A,'rows') . But there are two serious problems that do not allow us to use this function in this case. First, you want to count in all possible permutations of each row compared to other rows. If A has 5 columns, it means checking 120 different reorganizations per row! Sounds are impossible.

The second problem is unique ; It does not accept cells except arrays of character vector cells . Therefore, you cannot just pass A to unique and get what you expect.

Why look for an alternative? As you know, because it is currently very slow:

 With nested loop method: ------------------- Create the data (first loop): Elapsed time is 0.979059 seconds. ------------------- Make it unique (second loop): Elapsed time is 14.218691 seconds.

My decision:

Generate another array of cells containing the same cells, but converted to a string ( STR ).
Find the index of all unique elements ( id ).
Create a linked matrix with unique indices and sort the rows ( IC ).
Find unique rows .
Collect the corresponding lines A ( C ).

And this is the code:

 disp('------------------- Create the data:') tic G = [0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; ... 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; h = 7; M = reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2)); A = cell(size(M,1),2); for p = 1:size(M,1) A{p, 1} = squeeze(M(p,:,:)); left = ~ismember(G, A{p,1}, 'rows'); A{p,2} = G(left,:); end STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false); toc disp('------------------- Make it unique (vectorized):') tic [~, ~, id] = unique(STR); IC = sort(reshape(id, [], size(STR, 2)), 2); [~, col] = unique(IC, 'rows'); C = A(sort(col), :); % 'sort' makes the outputs exactly the same. toc

Performance Check:

 ------------------- Create the data: Elapsed time is 1.664119 seconds. ------------------- Make it unique (vectorized): Elapsed time is 0.017063 seconds.

Although initialization requires a bit more time and memory, this method is extremely fast in finding unique strings, given all permutations. Runtime is almost insensitive to the number of columns in A

+4

erfan Oct 13 '16 at 16:35

source share

It seems that G is a misleading point. Here is the result of nchoosek for a small number

 idx=nchoosek(1:4,2) ans = 1 2 1 3 1 4 2 3 2 4 3 4

the first line is in addition to the last line

the second line supplements the one before the last line

.....

therefore, if we extract the lines {1 , 2} from G , then its complement will be the lines {3, 4} , etc. In other words, assuming that the number of rows G is 4, then G(idx(1,:),:) is a complement to G(idx(end,:),:) .

Since the rows from G unique, all A{m,n} always have the same size.

A{p,1} and A{p,2} are complementary to each other. and the size of the unique strings A is equal to size(idx,1)/2

Therefore, there is no need for any loop or additional comparison:

 h=7; G = [0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; ... 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; idx = nchoosek(1:size(G,1),h); %concatenate complements M = [G(idx(1:size(idx,1)/2,:).',:), G(idx(end:-1:size(idx,1)/2+1,:).',:)]; %convert to cell so A1 is unique rows of A A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));

Update : the above method works best, however, if the idea is to get A1 from A, except for GI, suggest the following method based on erfan . Instead of converting an array to a string, we can directly work with the array:

 STR=reshape([A.'{:}],numel(A{1,1}),numel(A)).'; [~, ~, id] = unique(STR,'rows'); IC = sort(reshape(id, size(A, 2),[]), 1).'; [~, col] = unique(IC, 'rows'); C1 = A(sort(col), :);

Since I use Octave, I cannot run the mex file now, then I cannot test the Dev-iL method

Result

 erfan method (string): 4.54718 seconds. rahnema1 method (array): 0.012639 seconds.

Online demo

+3

rahnema1 Oct 16 '16 at 7:12

source share

Dev-il · Accepted Answer · 2016-10-16T13:04:55+0000

I would like to propose another idea that has some conceptual similarities to erfan . In my idea are used, namely GetMD5 Representation FEX .

The main task is to “reduce” each line in A to one representative value (for example, a character vector), and then find unique entries for this vector.

Judging by the standard in comparison with other sentences, my answer does not work as well as one of the alternatives, but I think its meaning is that it is completely data type agnostic (within GetMD5 ¹ restriction), that the algorithm is very easy to understand, this is a replacement replacement because it works on A and that the resulting array is exactly equal to the one that was obtained by the original method. Of course, this requires the compiler to work and have the risk of hash collisions (which can affect the result in VERY VERY rare cases).

The following are the results of a typical run on my computer, followed by the code:

 Original method timing: 8.764601s Dev-iL method timing: 0.053672s erfan method timing: 0.481716s rahnema1 method timing: 0.009771s

 function q39955559 G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; h=7; M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2)); A=cell(size(M,1),2); for p=1:size(M,1) A{p,1}=squeeze(M(p,:,:)); left=~ismember(G, A{p,1}, 'rows'); A{p,2}=G(left,:); end %% Benchmark: tic A1 = orig_sort(A); fprintf(1,'Original method timing:\t\t%fs\n',toc); tic A2 = hash_sort(A); fprintf(1,'Dev-iL' method timing:\t\t%fs\n',toc); tic A3 = erfan_sort(A); fprintf(1,'erfan' method timing:\t\t%fs\n',toc); tic A4 = rahnema1_sort(G,h); fprintf(1,'rahnema1' method timing:\t%fs\n',toc); assert(isequal(A1,A2)) assert(isequal(A1,A3)) assert(isequal(numel(A1),numel(A4))) % This is the best test I could come up with... function out = hash_sort(A) % Hash the contents: A_hashed = cellfun(@GetMD5,A,'UniformOutput',false); % Sort hashes of each row: A_hashed_sorted = A_hashed; for ind1 = 1:size(A_hashed,1) A_hashed_sorted(ind1,:) = sort(A_hashed(ind1,:)); end A_hashed_sorted = cellstr(cell2mat(A_hashed_sorted)); % Find unique rows: [~,ia,~] = unique(A_hashed_sorted,'stable'); % Extract relevant rows of A: out = A(ia,:); function A = orig_sort(A) %To find equivalent rows up to order I use a double loop (VERY slow). indices=[]; for j=1:size(A,1) if ismember(j,indices)==0 %if we have not already identified j as a duplicate for i=1:size(A,1) if i~=j if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))... &&... (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))... indices=[indices;i]; end end end end end A(indices,:)=[]; function C = erfan_sort(A) STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false); [~, ~, id] = unique(STR); IC = sort(reshape(id, [], size(STR, 2)), 2); [~, col] = unique(IC, 'rows'); C = A(sort(col), :); % 'sort' makes the outputs exactly the same. function A1 = rahnema1_sort(G,h) idx = nchoosek(1:size(G,1),h); %concatenate complements M = [G(idx(1:size(idx,1)/2,:),:), G(idx(end:-1:size(idx,1)/2+1,:),:)]; %convert to cell so A1 is unique rows of A A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));

¹ . If you need to create hashes of more complex data types, you can use the DataHash View FEX , which is slightly slower.

Find unique rows of an array of cells that take into account all possible permutations in each row

More articles: