Using LEFT join sql with sqldf
library(sqldf) sqldf('SELECT df2.id , df1.value FROM df2 LEFT JOIN df1 ON df2.id = df1.id') id value 1 1 1.000000 2 2 6.210526 3 3 11.421053 4 4 16.631579 5 5 21.842105 6 21 NA 7 22 NA 8 23 NA
EDIT add some slopes:
matching , as expected, is very fast. sqldf is very slow!
Verifying OP Data
library(microbenchmark) microbenchmark(ag(),ar.dt(),ar.me(),tl()) Unit: microseconds expr min lq median uq max 1 ag() 23071.953 23536.1680 24053.8590 26889.023 34256.354 2 ar.dt() 3123.972 3284.5890 3348.1155 3523.333 7740.335 3 ar.me() 950.807 1015.2815 1095.1160 1128.112 6330.243 4 tl() 41.340 45.8915 68.0785 71.112 187.735
Test with big data 1E6 rows of data.
here's how i generate my data:
N <- 1e6 df1 <- data.frame(id=as.character(1:N), value=seq(1, 100), stringsAsFactors=F) n2 <- 1000 df2 <- data.frame(id=sample(df1$id,n2), v2=NA, stringsAsFactors=F)
Surprise!! merge is 16 times faster than sqldf, and the data.table solution is the slowest!
Unit: milliseconds expr min lq median uq max 1 ag() 5678.0580 5865.3063 6034.9151 6214.3664 8084.6294 2 ar.dt() 8373.6083 8612.9496 8867.6164 9104.7913 10423.5247 3 ar.me() 387.4665 451.0071 506.8269 648.3958 1014.3099 4 tl() 174.0375 186.8335 214.0468 252.9383 667.6246
If the function ag, ar.dt, ar.me, tl is defined as follows:
ag <- function(){ require(sqldf) sqldf('SELECT df2.id , df1.value FROM df2 LEFT JOIN df1 ON df2.id = df1.id') } ar.dt <- function(){ require(data.table) dt1 <- data.table(df1, key="id") dt2 <- data.table(df2) dt1[dt2$id, value] } ar.me <- function(){ merge(df2, df1, by="id", all.x=T, sort=F) } tl <- function(){ df2Needed <- df2 df2Needed$v2 <- df1$value[match(df2$id, df1$id)] }
EDIT 2
It seems that making data.table is a bit unfair compared to it. To avoid confusion, I am adding a new function where, I suppose, I already have data.table structures.
ar.dtLight <- function(){ dt1[dt2$id, value] } library(microbenchmark) microbenchmark(ag(),ar.dt(),ar.me(),tl(),ar.dtLight,times=1) Unit: microseconds expr min lq median uq max 1 ag() 7247593.591 7247593.591 7247593.591 7247593.591 7247593.591 2 ar.dt() 8543556.967 8543556.967 8543556.967 8543556.967 8543556.967 3 ar.dtLight 1.139 1.139 1.139 1.139 1.139 4 ar.me() 462235.106 462235.106 462235.106 462235.106 462235.106 5 tl() 201988.996 201988.996 201988.996 201988.996 201988.996
It seems that creating keys (indexes) is time consuming. But after creating the indexes, the data.table method is second to none.