How to optimize a subset of a large data set?

Question

How to optimize a subset of a large data set?

I have this set of financial transactions, its quite large, but small enough to store in memory.

R> str(trans) 'data.frame': 130000000 obs. of 5 variables: $ id : int 5 5 5 5 6 11 11 11 11 11 ... $ kod : int 2 3 2 3 38 2 3 6 7 6 ... $ ar : int 329 329 330 330 7 329 329 329 329 329 ... $ belopp: num 1531 -229.3 324 -48.9 0 ... $ datum : int 36976 36976 37287 37287 37961 36976 36976 37236 37236 37281 ...

I need to go through it, retrieving transactions for each unique identifier, and do a bunch of calculations. The problem is that the subset of the dataset is too slow.

 R> system.time( + sub <- trans[trans$id==15,] + ) user system elapsed 7.80 0.55 8.36 R> system.time( + sub <- subset(trans, id == 15) + ) user system elapsed 8.49 1.05 9.53

Since this dataset has about 10 m of unique identifier, such a loop will take forever any ideas on how I can speed it up?

EDIT I tried with hadata.tables, indexing and sorting with little luck ..

 library(data.table) trans2 <- as.data.table(trans) trans2 <- trans2[order(id)] trans2 <- setkey(trans2, id) R> system.time( + sub <- trans2[trans2$id==15,] + ) user system elapsed 7.33 1.08 8.41 R> system.time( + sub <- subset(trans2, id == 15) + ) user system elapsed 8.66 1.12 9.78

EDIT2 Amazing.

 R> system.time( + sub <- trans2[J(15)] + ) user system elapsed 0 0 0

+4

r

jenswirf Jan 03 '13 at 13:09

source share

2 answers

Why not use a strategy of separation, application and integration?

Something like this (without sampled data, I don't know if this will work):

 fastsplit <- function (df) { lista <- split(seq(nrow(df)), df$id) return(lista) } # function to split the data frame into a list by id lista_split <- fastsplit(trans) # now, assuming that one of the calculations is, for instance, to sum belopp # apply the function to each subset result1 <- lapply(lista_split, function(.indx){ sum_bellop = sum(trans$belopp[.indx])}) # combine stage r1 <- do.call(rbind, result1)

Given the code above, I would say that it would be faster and easier if you could use SQL. Maybe sqldf package can help you here? I have never tried to do this. I do not know if it is fast. The code in SQL is pretty simple. To do the same as the R code above, just use something like:

 select id , sum(belopp) as sum_bellop from trans group by id

This will return a table with two columns, id and belopp sum by id

0

Manoel galdino Jan 03 '13 at 14:03

source share

Arun · Accepted Answer · 2013-01-03T13:49:05+0000

Note: message was edited by changing the calculated function from rowSums to colSums (using lapply in the case of data.table)

I do not think that you could get the result faster than data.table . Here is the reference between plyr and data.table . Of course, if the time-consuming part is your function, you can use doMC to work in parallel using plyr (assuming you have many cores or you work in a cluster). Otherwise, I stick to data.table . Here's an analysis with huge test data and a dummy function:

 # create a huge data.frame with repeating id values len <- 1e5 reps <- sample(1:20, len, replace = TRUE) x <- data.frame(id = rep(1:len, reps)) x <- transform(x, v1 = rnorm(nrow(x)), v2 = rnorm(nrow(x))) > nrow(x) [1] 1048534 # 1 million rows # construct functions for data.table and plyr # method 1 # using data.table DATA.TABLE <- function() { require(data.table) x.dt <- data.table(x, key="id") x.dt.out <- x.dt[, lapply(.SD, sum), by=id] } # method 2 # using plyr PLYR <- function() { require(plyr) x.plyr.out <- ddply(x, .(id), colSums) } # let benchmark > require(rbenchmark) > benchmark(DATA.TABLE(), PLYR(), order = "elapsed", replications = 1)[1:5] test replications elapsed relative user.self 1 DATA.TABLE() 1 1.006 1.00 .992 2 PLYR() 1 67.755 67.351 67.688

In a data.frame with 1 million rows, data.table takes data.table 0.992 seconds . The acceleration using data.table compared to plyr (by all accounts, when calculating column sums) is 68x . Depending on the computation time in your function, this acceleration will differ. But data.table will be even faster. plyr - split-apply-comb strategy. I do not think that you will get comparable acceleration compared to using the base to separate, apply and unify yourself. Of course you can try.

I ran the code with 10 million lines. data.table ran 5.893 seconds. plyr took 6300 seconds.

How to optimize a subset of a large data set?

More articles: