I try my best to make a "close" date match between two data frames. This question explores the solution using idata.frame from the idata.frame package, but I would be very pleased with the other solutions as well.
Here is a very simplified version of the two data frames:
sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"), date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21"))) sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d") samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"), rdate=c("2005-2-15","2005-03-15","2005-04-15", "2005-03-01","2005-04-20","2005-05-01"))) samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d")
In the actual sampleticker data sampleticker there are more than 30,000 rows with 40 columns and samplereport nearly 300,000 rows with 25 columns.
What I would like to do is combine the two data frames so that each row in sampleticker merged with the closest date match in samplereport that happens AFTER the date in sampleticker . I solved similar problems in the past by doing a simple merge in the ticker field, sorting in ascending order, and then choosing unique combinations of ticker and date. However, due to the size of this dataset, the merge explodes very quickly.
As far as I can tell, merge does not allow this kind of approximate match. I saw some solutions that use findInterval , but since the distance between dates will vary, I'm not sure I can specify an interval that will work for all strings.
Following another post here , I wrote the following code to use adply for each line and to make the connection:
library(plyr) merge<-adply(sampleticker,1,function(x){ y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date) y[which.min(y$rdate),] }))
This works pretty well: for sampled data, I get below what I want.
date ticker rdate 1 2005-01-25 A 2005-02-15 2 2005-03-30 A 2005-04-15 3 2005-02-15 AA 2005-03-01 4 2005-04-21 AA 2005-05-01
However, since the code performs 30,000 subset operations, it is very slow: I completed the above request more than a day before it finally killed it.
I see here that plyr 1.0 has an idata.frame structure that calls a data block by reference, greatly speeding up the operation of the subset. However, I cannot get the following code to work:
isamplereport<-idata.frame(samplereport) adply(sampleticker,1,function(x){ y<-subset(isamplereport,isamplereport$ticker %in% x$ticker & isamplereport$rdate > x$date) y[which.min(y$rdate),] })
I get an error
Error in list_to_dataframe(res, attr(.data, "split_labels")) : Results must be all atomic, or all data frames
This makes sense to me, as the operation returns idata.frame (I assume). However, changing the last line to:
as.data.frame(y[which.min(y$rdate),])
also causes an error:
Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) : undefined columns selected.
Note that calling as.data.frame in the plain old samplereport returns the original data frame as expected.
I know idata.frame is experimental, so I did not necessarily expect it to work correctly. However, if anyone has an idea how to fix this, I would appreciate it. Alternatively, if someone can offer a completely different approach that works more efficiently, that would be fantastic.
Matt
UPDATE Data.table is the right way to get around this. See below.