Or, to be more general, it is DT[,.SD[...],by=...] against merge(aggregate(...)) .
Without further comment, here is the data and an example:
set.seed(5141) size = 1e6 df <- data.table(a = rnorm(size), b = paste0(sample(letters, size, T), sample(letters, size, T), sample(letters, size, T)), c = sample(1:(size/10), size, T), d = sample(seq.Date(as.Date("2015-01-01"), as.Date("2015-05-31"), by="day"), size, T)) system.time(df[,.SD[d == max(d)], by = c])
Usually without data.table performance data.table I was surprised by this specific example. I had to multiply (fill) a fairly large frame of data, taking only the last (maybe simultaneous) occurrences of certain types of events. And save the rest of the relevant data for these specific events. However, it seems that .SD does not scale very well in this particular application.
Is there a better way to "data tables" to solve such problems?
source share