How to select rows from a matrix with a unique record in a specific column?

Question

How to select rows from a matrix with a unique record in a specific column?

I tried to solve this using a functional way, but I do not have much success.

Suppose you have a list of lists, and you want to select only those lists that have a unique entry in a specific position.

For example, suppose there is a matrix, and we want to select only rows that have unique elements in the first column.

Here is an example:

INPUT:

list= {{ 1,2}, {1,3},{4,5}}

I need a conclusion

 list={{1,2},{4,5}}

It does not matter which row should be deleted, the first one is fine, but everything is fine.

I tried Select, DeleteCases, DeleteDuplicates, Union and several other things, but I can not get it to work. I don’t know how to tell Mathematica to only look for a “unique” element. The union is close, but it works on a full list. those. I do not know what to write for the criteria, as in

 DeleteDuplicates[list, <now what?> ]

For reference, here is how I do it in Matlab:

 EDU>> A=[1 2;1 3;4 5] A = 1 2 1 3 4 5 EDU>> [B,I,J]=unique(A(:,1)); EDU>> A(I,:) ans = 1 3 4 5

thanks

+4

wolfram-mathematica

Nasser Aug 31 '11 at 9:17

source share

2 answers

Leonid gave a long and thorough answer, as he often does. However, I believe it is worth noting that an efficient and concise solution can be implemented with:

  First / @ GatherBy [hugeList, # [[1]] &]

Where 1 is the column index for comparison.

On my system, this is faster than delDupBy , but not as fast as deleteDuplicatesBy .

0

Mr. Wizard Nov 07 '11 at 17:11

source share

Leonid Shifrin · Accepted Answer · 2011-08-31T09:23:42+0000

Here is one way:

 DeleteDuplicates[list, First@ #1 === First@ #2 &]

EDIT

Please note that the timings and discussion below are based on the M7

After thinking a bit, I found a solution that would be (at least) an order of magnitude faster for large lists, and sometimes two orders of magnitude faster for this particular case (probably the best way to say that the solution below will have different computational complexity):

 Clear[delDupBy]; delDupBy[nested_List, n_Integer] := Module[{parts = nested[[All, n]], ord, unpos}, ord = Ordering[parts]; unpos = Most@Accumulate @Prepend[Map[Length, Split@parts [[ord]]], 1]; nested[[ Sort@ord [[unpos]]]]];

Landmarks:

 In[406]:= largeList = RandomInteger[{1,15},{50000,2}]; In[407]:= delDupBy[largeList,1]//Timing Out[407]= {0.016,{{13,4},{12,1},{1,6},{6,13},{10,12},{7,15},{8,14}, {14,4},{4,1},{11,9},{5,11},{15,4},{2,7},{3,2},{9,12}}} In[408]:= DeleteDuplicates[largeList, First@ # 1===First@ #2&]//Timing Out[408]= {1.265,{{13,4},{12,1},{1,6},{6,13},{10,12},{7,15},{8,14},{14,4}, {4,1},{11,9},{5,11},{15,4},{2,7},{3,2},{9,12}}}

This is especially noteworthy because DeleteDuplicates is a built-in function. I can make a blind assumption that DeleteDuplicates with a user test uses a pair-square-time comparison algorithm, and delDupBy n*log n in the size of the list.

I think this is an important lesson: when using custom tests, you should pay attention to built-in functions such as Union , Sort , DeleteDuplicates , etc. I discussed this in more detail in this Mathgroup Thread, where there are other insightful answers as well.

Finally, let me mention that it was this question that was asked (with a focus on efficiency) before here . I will reproduce here the solution that I gave for the case when the first (or, generally speaking, n -th) elements are positive integers (generalization to arbitrary integers is simple) .:

 Clear[sparseArrayElements]; sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]] Clear[deleteDuplicatesBy]; Options[deleteDuplicatesBy] = {Ordered -> True, Threshold -> 1000000}; deleteDuplicatesBy[data_List, n_Integer, opts___?OptionQ] := Module[{fdata = data[[All, n]], parr, rlen = Range[Length[data], 1, -1], preserveOrder = Ordered /. Flatten[{opts}] /. Options[deleteDuplicatesBy], threshold = Threshold /. Flatten[{opts}] /. Options[deleteDuplicatesBy], dim}, dim = Max[fdata]; parr = If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]]; parr[[fdata[[rlen]]]] = rlen; parr = sparseArrayElements@If [dim < threshold, SparseArray@parr , parr]; data[[If[preserveOrder, Sort@parr , parr]]] ];

How it works is to use the first (or, as a rule, n -th) elements as positions in some huge tables, which we pre-distribute using the fact that they are positive integers). In some cases, this can lead to crazy results. Note:

 In[423]:= hugeList = RandomInteger[{1,1000},{500000,2}]; In[424]:= delDupBy[hugeList,1]//Short//Timing Out[424]= {0.219,{{153,549},{887,328},{731,825},<<994>>,{986,150},{92,581},{988,147}}} In[430]:= deleteDuplicatesBy[hugeList,1]//Short//Timing Out[430]= {0.032,{{153,549},{887,328},{731,825},<<994>>,{986,150},{92,581},{988,147}}}

How to select rows from a matrix with a unique record in a specific column?

More articles: