I am processing a data table called orderFlow and I calculate the potential of Welfare.tmp as a result. So far, my approach has been based on a plyr basis, but because of the input orderFlow , which has millions of rows, I would prefer a solution that uses the performance of data.table in R.
# solution so far, poor performance on huge orderFlow input data.table require(plyr) potentialWelfare.tmp = ddply(orderFlow, .variables = c("simulationrun_id", "db"), .fun = calcPotentialWelfare, .progress = "text", .parallel=TRUE)
Edit1: In short, the user-defined function checks if there are more bets or requests in df and summarizes the NbAsks score sorted (by rating) bids. This is done to select the most valuable offers and summarize their ratings. The code is deprecated, perhaps not very efficient, but it served its purpose in combination with plyr and regular data.frames.
calcPotentialWelfare <- function(df){ NbAsks = dim(df[df$type=="ask",])[1] # print(NbAsks) Bids = df[df$type == "bid",] # dd[with(dd, order(-z, b)), ] Bids = Bids[with(Bids,order(valuation,decreasing = TRUE)),] NbBids = dim(df[df$type == "bid",])[1] # print(Bids) if (NbAsks > 0){ Bids = Bids[1:min(NbAsks,NbBids),] potentialWelfare = sum(Bids$valuation) return(potentialWelfare) } else{ potentialWelfare = 0 return(potentialWelfare) } }
Unfortunately, I cannot find a way to implement this using data.table. What I have reached so far using the data.table and the corresponding FAQs:
I get
Error in `[.data.frame`(orderFlow, , lapply(.SD, calcPotentialWelfare), : unused arguments (by = list(simulationrun_id, db), .SDcols = c("simulationrun_id", "db"))
Here is the input:
> head(orderFlow) type valuation price dateCreation dateDue dateMatched id 1 ask 0.30000000 0.3 2012-01-01 00:00:00.000000 2012-01-01 00:30:00.000000 2012-01-01 00:01:01.098307 1 2 bid 0.39687633 0.0 2012-01-01 00:01:01.098307 2012-01-01 00:10:40.024807 2012-01-01 00:01:01.098307 2 3 bid 0.96803384 NA 2012-01-01 00:03:05.660811 2012-01-01 00:06:26.368941 <NA> 3 4 bid 0.06163186 NA 2012-01-01 00:05:25.413959 2012-01-01 00:09:06.189893 <NA> 4 5 bid 0.57017143 NA 2012-01-01 00:10:10.344876 2012-01-01 00:57:58.998516 <NA> 5 6 bid 0.37188442 NA 2012-01-01 00:11:25.761372 2012-01-01 00:43:24.274176 <NA> 6 created_at updated_at simulationrun_id db 1 2013-12-10 14:37:29.065634 NA 7004 1 2 2013-12-10 14:37:29.065674 NA 7004 1 3 2013-12-10 14:37:29.065701 NA 7004 1 4 2013-12-10 14:37:29.065726 NA 7004 1 5 2013-12-10 14:37:29.065750 NA 7004 1 6 2013-12-10 14:37:29.065775 NA 7004 1
I expect something like this as a result, i.e. The calcPotentialWelfare function is an aggregation of data in some special way from the evaluation of the "column" of data. table orderFlow.
> head(potentialWelfare.tmp) simulationrun_id db potentialWelfare 1 1 1 16.86684 2 2 1 18.44314 3 4 1 16.86684 4 5 1 18.44314 5 7 1 16.86684 6 8 1 18.44314
Really excited to see how this can be resolved. Thanks for reading!
Edit2:
> dput(head(orderFlow)) structure(list(type = c("ask", "bid", "bid", "bid", "bid", "bid" ), valuation = c(0.3, 0.39687632952068, 0.968033835246625, 0.0616318564942726, 0.570171430446081, 0.371884415116724), price = c(0.3, 0, NA, NA, NA, NA), dateCreation = c("2012-01-01 00:00:00.000000", "2012-01-01 00:01:01.098307", "2012-01-01 00:03:05.660811", "2012-01-01 00:05:25.413959", "2012-01-01 00:10:10.344876", "2012-01-01 00:11:25.761372"), dateDue = c("2012-01-01 00:30:00.000000", "2012-01-01 00:10:40.024807", "2012-01-01 00:06:26.368941", "2012-01-01 00:09:06.189893", "2012-01-01 00:57:58.998516", "2012-01-01 00:43:24.274176"), dateMatched = c("2012-01-01 00:01:01.098307", "2012-01-01 00:01:01.098307", NA, NA, NA, NA), id = 1:6, created_at = c("2013-12-10 14:37:29.065634", "2013-12-10 14:37:29.065674", "2013-12-10 14:37:29.065701", "2013-12-10 14:37:29.065726", "2013-12-10 14:37:29.065750", "2013-12-10 14:37:29.065775"), updated_at = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), simulationrun_id = c(7004L, 7004L, 7004L, 7004L, 7004L, 7004L), db = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("type", "valuation", "price", "dateCreation", "dateDue", "dateMatched", "id", "created_at", "updated_at", "simulationrun_id", "db"), row.names = c(NA, 6L), class = "data.frame")