Data.table - select the first n rows within the group

Question

Data.table - select the first n rows within the group

How simple it is, I don’t know the solution to data.table to select the first n rows in groups in a data table. Could you help me?

+8

r data.table

paljenczy Jan 12 '16 at 20:20

source share

2 answers

We can use head with .SD

 library(data.table) dt <- data.table(mtcars) > dt[, head(.SD, 3), by = "cyl"] cyl mpg disp hp drat wt qsec vs am gear carb 1: 6 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4 2: 6 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4 3: 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1 4: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 5: 4 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2 6: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 7: 8 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 9: 8 16.4 275.8 180 3.07 4.070 17.40 0 0 3 3

+5

paljenczy Jan 12 '16 at 20:20

source share

Jaap · Accepted Answer · 2016-01-12T20:33:36+0000

As an alternative:

 dt[, .SD[1:3], cyl]

When you look at speed using an example dataset, the head method is on par with the .I method of @eddi . Comparison with microbenchmark :

 microbenchmark(head = dt[, head(.SD, 3), cyl], SD = dt[, .SD[1:3], cyl], I = dt[dt[, .I[1:3], cyl]$V1], times = 10, unit = "relative")

leads to:

 Unit: relative expr min lq mean median uq max neval cld head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a

However, data.table specifically designed for large data sets. So repeat this comparison:

 # creating a 30 million dataset largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl] # running the benchmark on the large dataset microbenchmark(head = largeDT[, head(.SD, 3), cyl], SD = largeDT[, .SD[1:3], cyl], I = largeDT[largeDT[, .I[1:3], cyl]$V1], times = 10, unit = "relative")

leads to:

 Unit: relative expr min lq mean median uq max neval cld head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a

Now the .I method is undoubtedly the fastest.

Update 2016-02-12:

When using the latest version of the data.table package, the .I method still wins. Is it possible that the .SD method or the head() method is faster depending on the size of the data set. Now the standard gives:

 Unit: relative expr min lq mean median uq max neval cld head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

However, with a slightly smaller dataset (but still quite large), the odds change:

 largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the standard is now a bit in favor of the head method with the .SD method:

 Unit: relative expr min lq mean median uq max neval cld head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a

Data.table - select the first n rows within the group

More articles: