How to extract unique rows from a subset of columns in a data table?

Question

How to extract unique rows from a subset of columns in a data table?

I would like to take unique rows from the data.table, given the subset of columns and condition in i. What is the best way to do this? ("Best" in terms of computational speed and short or readable syntax)

set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )

Let's say I would like to find unique combinations c1and c2, where c4equal to 10. I can come up with a couple of ways to do this, but I'm not sure which is optimal. Regardless of whether the columns to extract are the key or not, it can also be important.

## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

+4

r unique data.table

Blue magister Oct 24 '13 at 18:49

source share

1 answer

mrip · Accepted Answer · 2013-10-24T21:00:38+0000

, , unique(jk[c4 >= 10, list(c1, c2)]), @Justin, unique(jk[c4 >= 10, c("c1", "c2"), with = F]). , , :

microbenchmark(
a=jk[c4 >= 10, list(c1,c2), keyby = list(c1,c2)][,c("c1","c2"),with=F],
b=jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")],
c=unique(jk[c4>=10,list(c1,c2)]),
d=unique(jk[c4>=10,c("c1","c2"),with=F])
)

Unit: microseconds
 expr      min       lq    median        uq      max neval
    a 1378.742 1456.676 1494.9380 1531.1395 2515.796   100
    b  906.404  943.072  963.7790  997.4930 3805.846   100
    c 1167.125 1201.988 1232.3500 1272.2250 2077.047   100
    d  627.768  653.314  669.8625  683.8045  739.808   100

How to extract unique rows from a subset of columns in a data table?

More articles: