R - Acceleration of the approximate date. idata.frame?

I try my best to make a "close" date match between two data frames. This question explores the solution using idata.frame from the idata.frame package, but I would be very pleased with the other solutions as well.

Here is a very simplified version of the two data frames:

 sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"), date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21"))) sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d") samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"), rdate=c("2005-2-15","2005-03-15","2005-04-15", "2005-03-01","2005-04-20","2005-05-01"))) samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d") 

In the actual sampleticker data sampleticker there are more than 30,000 rows with 40 columns and samplereport nearly 300,000 rows with 25 columns.

What I would like to do is combine the two data frames so that each row in sampleticker merged with the closest date match in samplereport that happens AFTER the date in sampleticker . I solved similar problems in the past by doing a simple merge in the ticker field, sorting in ascending order, and then choosing unique combinations of ticker and date. However, due to the size of this dataset, the merge explodes very quickly.

As far as I can tell, merge does not allow this kind of approximate match. I saw some solutions that use findInterval , but since the distance between dates will vary, I'm not sure I can specify an interval that will work for all strings.

Following another post here , I wrote the following code to use adply for each line and to make the connection:

 library(plyr) merge<-adply(sampleticker,1,function(x){ y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date) y[which.min(y$rdate),] })) 

This works pretty well: for sampled data, I get below what I want.

  date ticker rdate 1 2005-01-25 A 2005-02-15 2 2005-03-30 A 2005-04-15 3 2005-02-15 AA 2005-03-01 4 2005-04-21 AA 2005-05-01 

However, since the code performs 30,000 subset operations, it is very slow: I completed the above request more than a day before it finally killed it.

I see here that plyr 1.0 has an idata.frame structure that calls a data block by reference, greatly speeding up the operation of the subset. However, I cannot get the following code to work:

 isamplereport<-idata.frame(samplereport) adply(sampleticker,1,function(x){ y<-subset(isamplereport,isamplereport$ticker %in% x$ticker & isamplereport$rdate > x$date) y[which.min(y$rdate),] }) 

I get an error

 Error in list_to_dataframe(res, attr(.data, "split_labels")) : Results must be all atomic, or all data frames 

This makes sense to me, as the operation returns idata.frame (I assume). However, changing the last line to:

 as.data.frame(y[which.min(y$rdate),]) 

also causes an error:

 Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) : undefined columns selected. 

Note that calling as.data.frame in the plain old samplereport returns the original data frame as expected.

I know idata.frame is experimental, so I did not necessarily expect it to work correctly. However, if anyone has an idea how to fix this, I would appreciate it. Alternatively, if someone can offer a completely different approach that works more efficiently, that would be fantastic.

Matt

UPDATE Data.table is the right way to get around this. See below.

+7
source share
3 answers

Thanks to Matthew Dole and his ability to roll back as well as forward in data.table, it is now much easier to perform this merge.

 ST <- data.table(sampleticker) SR <- data.table(samplereport) setkey(ST,ticker,date) SR[,mergerdate:=rdate] setkey(SR,ticker,mergerdate) merge<-SR[ST,roll=-Inf] setnames(merge,"mergerdate","date") # ticker date rdate # 1: A 2005-01-25 2005-02-15 # 2: A 2005-03-30 2005-04-15 # 3: AA 2005-02-15 2005-03-01 # 4: AA 2005-04-21 2005-05-01 
+8
source

Here is the data.table solution, which will probably work better than what you are currently using:

 library(data.table) ST <- data.table(sampleticker, key="ticker") SR <- data.table(samplereport, key="ticker") SR <- SR[with(SR, order(ticker, rdate)),] # rdates need to be in increasing order SR[ST, list(date = date, rdate = rdate[match(TRUE, (rdate > date))]), ] ticker date rdate [1,] A 2005-01-25 2005-02-15 [2,] A 2005-03-30 2005-04-15 [3,] AA 2005-02-15 2005-03-01 [4,] AA 2005-04-21 2005-05-01 

Of course, it looks like you really want to combine two much wider data elements. To demonstrate one way to achieve this, in the example below, I add several columns to both data.tables tables, and then show how you can combine the corresponding rows:

 # Add some columns to both data.tables ST$alpha <- letters[seq_len(nrow(ST))] SR$n <- seq_len(nrow(SR)) SR$ALPHA <- LETTERS[seq_len(nrow(SR))] # Perform a merge that includes the whole rows from samplereport # corresponding to the selected rdate RES <- SR[ST, cbind(date, .SD[match(TRUE,(rdate>date)),-1]), ] # Merge res (containing the selected rows from samplereport) back together # with sampleticker keycols <- c("ticker", "date") setkeyv(RES, keycols) setkeyv(ST, keycols) ST[RES] # ticker date alpha rdate n ALPHA # [1,] A 2005-01-25 a 2005-02-15 1 A # [2,] A 2005-03-30 b 2005-04-15 3 C # [3,] AA 2005-02-15 c 2005-03-01 4 D # [4,] AA 2005-04-21 d 2005-05-01 6 F 
+6
source

Here is the solution that follows Matthew Dole’s observation that this is the natural place to apply the data.table roll=TRUE argument.

If you want to apply it, there is one wrinkle that needs to be ironed. roll=TRUE designed so that if no exact match is found for the last column of the key (here is the date), the value from the nearest previous date will be moved forward. However, you want the opposite (and even if there is an exact match, you still need the value from the next available date).

The first attempt may be sorting by "ticker" , and by "rdate" in reverse order, merging with the resulting ordered SR . This would work, except that data.table does not want to let you sort in the reverse order: keying by "rdate" forces this column in ascending order. ( data.table needs to do this in order to implement the quick negotiation and connection for which it was designed).

My solution below is to create a new column - "rnd" for "reverse numeric date" - in both data.tables whose values ​​are generated using -as.numeric(date) . This assigns a unique value to each date. Moreover, since the values ​​were multiplied by -1 , sorting them in ascending order affects the sort dates in descending order .

(Another detail: because you do not want exact matches, and instead always want the next date after the current one, I subtracted 1 from sampleticker rnd , which has the desired effect. That it does its job correctly, I slightly edited your example data to include one possible exact match ( "2005-1-25" ), which should not be chosen by merging).

 # Create sample data.tables library(data.table) ST <- data.table(ticker = c("A","A","AA","AA"), date = as.Date(c("2005-1-25","2005-03-30","2005-02-15", "2005-04-21"), format="%Y-%m-%d"), alpha = letters[1:4]) SR <- data.table(ticker = c("A","A","A","AA","AA","AA"), rdate = as.Date(c("2005-1-25","2005-03-15","2005-04-15", "2005-03-01","2005-04-20","2005-05-01"), format="%Y-%m-%d"), ALPHA = LETTERS[1:6]) 

With the sample data in hand, configure and perform the required merge:

 # Create a "reverse numerical date" column, which will uniquely # identify date, and allow them to be sorted in reverse temporal order ST$rnd <- -(as.numeric(ST$date) + 1) SR$rnd <- -(as.numeric(SR$rdate)) # key (and thus sort) both data.tables by ticker and "reverse numerical date" keycols <- c("ticker", "rnd") setkeyv(ST, keycols) setkeyv(SR, keycols) # The syntax of the merge is now as simple as can be res <- SR[ST, roll=TRUE] # Finally, put the results back in temporal order, and pretty up the column order setkeyv(res, c("ticker", "date")) setcolorder(res, c("ticker", "date", "rdate", "alpha", "ALPHA", "rnd")) res # ticker date rdate alpha ALPHA rnd # [1,] A 2005-01-25 2005-03-15 a B -12809 # [2,] A 2005-03-30 2005-04-15 b C -12873 # [3,] AA 2005-02-15 2005-03-01 c D -12830 # [4,] AA 2005-04-21 2005-05-01 d F -12895 
+4
source

All Articles