R: data.table vs merge (aggregate ()) performance

Or, to be more general, it is DT[,.SD[...],by=...] against merge(aggregate(...)) .

Without further comment, here is the data and an example:

 set.seed(5141) size = 1e6 df <- data.table(a = rnorm(size), b = paste0(sample(letters, size, T), sample(letters, size, T), sample(letters, size, T)), c = sample(1:(size/10), size, T), d = sample(seq.Date(as.Date("2015-01-01"), as.Date("2015-05-31"), by="day"), size, T)) system.time(df[,.SD[d == max(d)], by = c]) # user system elapsed # 50.89 0.00 51.00 system.time(merge(aggregate(d ~ c, data = df, max), df)) # user system elapsed # 18.24 0.20 18.45 

Usually without data.table performance data.table I was surprised by this specific example. I had to multiply (fill) a fairly large frame of data, taking only the last (maybe simultaneous) occurrences of certain types of events. And save the rest of the relevant data for these specific events. However, it seems that .SD does not scale very well in this particular application.

Is there a better way to "data tables" to solve such problems?

+6
source share
1 answer

We can use .I to get the row index and a subset of rows based on this. It should be faster.

 system.time(df[df[,.I[d == max(d)], by = c]$V1]) # user system elapsed # 5.00 0.09 5.30 

@ Heroka solution

 system.time(df[,is_max:=d==max(d), by = c][is_max==T,]) # user system elapsed # 5.06 0.00 5.12 

The aggregate method on my machine gives

 system.time(merge(aggregate(d ~ c, data = df, max), df)) # user system elapsed # 48.62 1.00 50.76 

with option .SD

 system.time(df[,.SD[d == max(d)], by = c]) # user system elapsed # 151.13 0.40 156.57 

Using data.table join

 system.time(df[df[, list(d=max(d)) , c], on=c('c', 'd')]) # user system elapsed # 0.58 0.01 0.60 

If we look at comparisons between merge/aggregate and == , these are different functions. Typically, the aggregate/merge method will be slower than the corresponding connection to data.table . But instead, we use == , which compares each line (takes some time) along with .SD for a subset (which is also relatively less efficient compared to .I for indexing lines). .SD also has an overhead [.data.table .

+8
source

All Articles