This is where the use of data.table begins:
billdata <- read.table(text=" acct amount begin end days 1 2242 11349 2009-10-06 2009-11-04 29 2 2242 12252 2009-11-04 2009-12-04 30 3 2242 21774 2009-12-04 2010-01-08 35 4 2242 18293 2010-01-08 2010-02-05 28 5 2243 27217 2009-10-06 2009-11-04 29 6 2243 117 2009-11-04 2009-12-04 30 7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1) require(data.table) DT = as.data.table(billdata)
First change the type of the begin and end columns to dates. Unlike data.frame, this does not copy the entire data set.
DT[,begin:=as.Date(begin)] DT[,end:=as.Date(end)]
Then find the time interval, find the prevailing daily account and population.
alldays = DT[,seq(min(begin),max(end),by="day")] setkey(DT, acct, begin) DT[CJ(unique(acct),alldays), mean(amount/days,na.rm=TRUE), by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE] acct month V1 1: 2242 2009-10 391.34483 2: 2242 2009-11 406.69448 3: 2242 2009-12 601.43226 4: 2242 2010-01 646.27465 5: 2242 2010-02 653.32143 6: 2243 2009-10 938.51724 7: 2243 2009-11 97.36172 8: 2243 2009-12 375.68065 9: 2243 2010-01 415.51429 10: 2243 2010-02 415.51429
I think you will find that the prevailing join logic is rather cumbersome in SQL and slower.
I say this hint because it is not quite right. Note line 10 is repeated because account 2243 is not stretched in 2010-02, unlike account 2242. To complete it, you could rbind the last line for each account and use rolltolast instead of roll . Or perhaps create alldays for an account, not for all accounts.
See if the speed is higher and we can go from there.
You will probably encounter a bug in 1.8.2, which was fixed in 1.8.3. I am using v1.8.3.
"Internal" error message when combining a connection containing missing groups and groups by fixed, # 2162. For example: X [Y, .N, c = NonJoinColumn] where Y contains some lines that do not match X. This error may also cause segment malfunction.
Let me know, and we can either get around or upgrade to version 1.8.3 from R-Forge.
Btw, good example data. This expedited the response.
Here is the complete answer mentioned above. I find it difficult to admit that it combines several data.table functions. This should work in 1.8.2, as it happens, but I tested only in 1.8.3.
DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]), mean(amount/days,na.rm=TRUE), by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE] acct month V1 1: 2242 2009-10 391.34483 2: 2242 2009-11 406.69448 3: 2242 2009-12 601.43226 4: 2242 2010-01 646.27465 5: 2242 2010-02 653.32143 6: 2243 2009-10 938.51724 7: 2243 2009-11 97.36172 8: 2243 2009-12 375.68065 9: 2243 2010-01 415.51429