Applying custom function on data.table instead of using plyr and ddply

Question

Applying custom function on data.table instead of using plyr and ddply

I am processing a data table called orderFlow and I calculate the potential of Welfare.tmp as a result. So far, my approach has been based on a plyr basis, but because of the input orderFlow , which has millions of rows, I would prefer a solution that uses the performance of data.table in R.

# solution so far, poor performance on huge orderFlow input data.table require(plyr) potentialWelfare.tmp = ddply(orderFlow, .variables = c("simulationrun_id", "db"), .fun = calcPotentialWelfare, .progress = "text", .parallel=TRUE)

Edit1: In short, the user-defined function checks if there are more bets or requests in df and summarizes the NbAsks score sorted (by rating) bids. This is done to select the most valuable offers and summarize their ratings. The code is deprecated, perhaps not very efficient, but it served its purpose in combination with plyr and regular data.frames.

  calcPotentialWelfare <- function(df){ NbAsks = dim(df[df$type=="ask",])[1] # print(NbAsks) Bids = df[df$type == "bid",] # dd[with(dd, order(-z, b)), ] Bids = Bids[with(Bids,order(valuation,decreasing = TRUE)),] NbBids = dim(df[df$type == "bid",])[1] # print(Bids) if (NbAsks > 0){ Bids = Bids[1:min(NbAsks,NbBids),] potentialWelfare = sum(Bids$valuation) return(potentialWelfare) } else{ potentialWelfare = 0 return(potentialWelfare) } }

Unfortunately, I cannot find a way to implement this using data.table. What I have reached so far using the data.table and the corresponding FAQs:

  # trying to use data.table, but it doesn't work so far. potentialWelfare.tmp = orderFlow[, lapply(.SD, calcPotentialWelfare), by = list(simulationrun_id, db),.SDcols=c("simulationrun_id", "db")]

I get

  Error in `[.data.frame`(orderFlow, , lapply(.SD, calcPotentialWelfare), : unused arguments (by = list(simulationrun_id, db), .SDcols = c("simulationrun_id", "db"))

Here is the input:

  > head(orderFlow) type valuation price dateCreation dateDue dateMatched id 1 ask 0.30000000 0.3 2012-01-01 00:00:00.000000 2012-01-01 00:30:00.000000 2012-01-01 00:01:01.098307 1 2 bid 0.39687633 0.0 2012-01-01 00:01:01.098307 2012-01-01 00:10:40.024807 2012-01-01 00:01:01.098307 2 3 bid 0.96803384 NA 2012-01-01 00:03:05.660811 2012-01-01 00:06:26.368941 <NA> 3 4 bid 0.06163186 NA 2012-01-01 00:05:25.413959 2012-01-01 00:09:06.189893 <NA> 4 5 bid 0.57017143 NA 2012-01-01 00:10:10.344876 2012-01-01 00:57:58.998516 <NA> 5 6 bid 0.37188442 NA 2012-01-01 00:11:25.761372 2012-01-01 00:43:24.274176 <NA> 6 created_at updated_at simulationrun_id db 1 2013-12-10 14:37:29.065634 NA 7004 1 2 2013-12-10 14:37:29.065674 NA 7004 1 3 2013-12-10 14:37:29.065701 NA 7004 1 4 2013-12-10 14:37:29.065726 NA 7004 1 5 2013-12-10 14:37:29.065750 NA 7004 1 6 2013-12-10 14:37:29.065775 NA 7004 1

I expect something like this as a result, i.e. The calcPotentialWelfare function is an aggregation of data in some special way from the evaluation of the "column" of data. table orderFlow.

  > head(potentialWelfare.tmp) simulationrun_id db potentialWelfare 1 1 1 16.86684 2 2 1 18.44314 3 4 1 16.86684 4 5 1 18.44314 5 7 1 16.86684 6 8 1 18.44314

Really excited to see how this can be resolved. Thanks for reading!

Edit2:

  > dput(head(orderFlow)) structure(list(type = c("ask", "bid", "bid", "bid", "bid", "bid" ), valuation = c(0.3, 0.39687632952068, 0.968033835246625, 0.0616318564942726, 0.570171430446081, 0.371884415116724), price = c(0.3, 0, NA, NA, NA, NA), dateCreation = c("2012-01-01 00:00:00.000000", "2012-01-01 00:01:01.098307", "2012-01-01 00:03:05.660811", "2012-01-01 00:05:25.413959", "2012-01-01 00:10:10.344876", "2012-01-01 00:11:25.761372"), dateDue = c("2012-01-01 00:30:00.000000", "2012-01-01 00:10:40.024807", "2012-01-01 00:06:26.368941", "2012-01-01 00:09:06.189893", "2012-01-01 00:57:58.998516", "2012-01-01 00:43:24.274176"), dateMatched = c("2012-01-01 00:01:01.098307", "2012-01-01 00:01:01.098307", NA, NA, NA, NA), id = 1:6, created_at = c("2013-12-10 14:37:29.065634", "2013-12-10 14:37:29.065674", "2013-12-10 14:37:29.065701", "2013-12-10 14:37:29.065726", "2013-12-10 14:37:29.065750", "2013-12-10 14:37:29.065775"), updated_at = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), simulationrun_id = c(7004L, 7004L, 7004L, 7004L, 7004L, 7004L), db = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("type", "valuation", "price", "dateCreation", "dateDue", "dateMatched", "id", "created_at", "updated_at", "simulationrun_id", "db"), row.names = c(NA, 6L), class = "data.frame")

+8

r data.table plyr

Peter Lustig Dec 16 '13 at 21:58

source share

1 answer

Arun · Accepted Answer · 2013-12-16T22:54:47+0000

I think it should be faster. There are some errors in the way you use data.table . I suggest you familiarize yourself with the introduction, familiarize yourself with examples, and read the FAQ.

 calcPotentialWelfare <- function(dt){ NbAsks = nrow(dt["ask", nomatch=0L]) # binary search based subset/join - very fast Bids = dt["bid", nomatch=0L] # binary search based subset/join - very fast NbBids = nrow(Bids) # for each 'type', the 'valuation' will always be sorted, # but in ascending order - but you need descending order # so you can just use the function 'tail' to fetch the last 'n' items... as follows. if (NbAsks > 0) return(sum(tail(Bids, min(NbAsks, NbBids))$valuation)) else return(0) } # setkey on 'type' column to use binary search based subset/join in the function # also on valuation so that we don't have to 'order' for every group # inside the function - we can use 'tail' setkey(orderFlow, type, valuation) potentialWelfare.tmp = orderFlow[, calcPotentialWelfare(.SD), by=.(simulationrun_id, db), .SDcols=c("type", "valuation")]

.SD is a special variable that creates a data table for each group with all columns that are not mentioned in by= (if .SDcols not specified). If .SDcols specified, then .SD is created for each groupw only with the specified columns, with data corresponding to this group.

Using lapply(.SD, ...) provides every function column that you don't need. You need to send all the data to the function. However, since your function only requires a column type and an "evaluation", you can speed it up by providing .SDcols=c('type', 'valuation') . This will save a lot of time by ignoring the rest of the columns.

Applying custom function on data.table instead of using plyr and ddply

More articles: