Filter duplicate / non-duplicate rows in data.table

Question

Filter duplicate / non-duplicate rows in data.table

I have a data.table with approximately 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously, for data.frame, I would do the following: df -> unique(df[,c('V1', 'V2')]) , but this does not work with data.table. I tried unique(df[,c(V1,V2), with=FALSE]) , but it seems to still work only with the data.table key, not the whole row.

Any suggestions?

Cheers, Davy

Example

 >dt V1 V2 [1,] AB [2,] AC [3,] AD [4,] AB [5,] BA [6,] CD [7,] CD [8,] EF [9,] GG [10,] AB

in the data table above. where V2 is the key of the table, only lines 4.7 and 10 will be deleted.

 > dput(dt) structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")

+65

r duplicate-removal data.table

Davy Kavanagh Aug 03 2018-12-12T00:

source share

4 answers

With your example data.table ...

 > dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G")) > setkey(dt,V2)

Consider the following tests:

 > haskey(dt) # obviously dt has a key, since we just set it [1] TRUE > haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key [1] FALSE > haskey(dt[,.SD]) # note that this still has a key [1] TRUE

So, you can list the columns of the table and then take unique() from this, without having to set the key to all columns or discard it (by setting it to NULL ), as required by @Andrie's solution (and edited by @MatthewDowle). The solutions suggested by @Pop and @Rahul did not work for me.

See example 3, which is very similar to your initial attempt. Your example was not clear, so I'm not sure why it is not working. It was also a few months ago when you posted the question, so maybe data.table was updated?

 > unique(dt) # Try 1: wrong answer (missing V1=C and V2=D) V1 V2 1: BA 2: AB 3: AC 4: AD 5: EF 6: GG > dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D) V1 V2 1: BA 2: AB 3: AC 4: AD 5: EF 6: GG > unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key V1 V2 1: BA 2: AB 3: AC 4: AD 5: CD 6: EF 7: GG > setkey(dt,NULL) > unique(dt) # Try 4: correct answer; requires key to be removed V1 V2 1: BA 2: AB 3: AC 4: AD 5: CD 6: EF 7: GG

+6

dnlbrky Jan 16 '13 at 2:50

source share

unique(df) works on your example.

+1

Pop Aug 03 2018-12-12T00: 00Z

source share

This should work for you.

 dt <- unique(dt, by = c('V1, 'V2'))

0

Magma Apr 08 '19 at 10:21

source share

Andrie · Accepted Answer · 2012-08-03 09:04

Until v1.9.8

From ?unique.data.table it is clear that calling unique on a data table works only for a key. This means that you must reset the key for all columns before calling unique .

 library(data.table) dt <- data.table( V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)], V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)] )

Call unique with one column as a key:

 setkey(dt, "V2") unique(dt) V1 V2 [1,] BA [2,] AB [3,] AC [4,] AD [5,] EF [6,] GG

For v1.9. 8+

From ?unique.data.table By default, all columns are used (which is consistent with ?unique.data.frame )

 unique(dt) V1 V2 1: AB 2: AC 3: AD 4: BA 5: CD 6: EF 7: GG

Or using the by argument to get unique combinations of specific columns (as keys were previously used)

 unique(dt, by = "V2") V1 V2 1: AB 2: AC 3: AD 4: BA 5: EF 6: GG

Filter duplicate / non-duplicate rows in data.table

More articles: