Speeding up data frame matching

I have two data frames, something like this:

data = data.frame(data=cbind(1:12,rep(c(1,2),6),rep(c(1,2,3),4))) colnames(data)=c('v','h','c') lookup = data.frame(data=cbind(c(rep(1,3),rep(2,3)),rep(c(1,2,3),2),21:26)) colnames(lookup)=c('h','c','t') 

I want to subtract lookup $ t from the data $ v, where the columns h and c correspond.

I thought something like this would work

 data$v-lookup$t[lookup$h==data$h&lookup$c==data$c] 

but it’s not magically known that I want to implicitly iterate over data rows

I ended up doing it

 myt = c() for(i in 1:12) { myt[i] = lookup$t[lookup$h==data$h[i]&lookup$c==data$c[i]] } 

which works fine, but I hope someone can suggest a more reasonable way that doesn't include a loop.

+7
source share
4 answers

It looks like you can merge and then do the math:

 dataLookedUp <- merge(data, lookup) dataLookedUp$newValue <- with(dataLookedUp, v - t ) 

Is merging and calculation faster for your real data?

If the data and / or search is really big, you can use data.table to create an index before merging to speed it up.

+8
source

An alternative that is 1.) more familiar to those who are used to SQL queries and 2.) often faster than standard merging, is to use sqldf . (Note that on Mac OS X, you probably want to install Tcl / Tk, which sqldf depends on.) As an added bonus, sqldf will by default convert strings to factors.

 install.packages("sqldf") library(sqldf) data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4)) lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t = 21:26) soln <- sqldf("select * from data inner join lookup using (h, c)") soln <- transform(soln, v.minus.t = v - t) 
+6
source

With whole columns, I don’t think you can do anything that can be improved for the JD clause, but if you had rows in the columns you merged on, you could create factors with as.factor , which could would speed up the merge depending on the size of your dataset and the number of merges / sorts you expect:

 data <- data.frame(v = 1:12, h = rep(c("one", "two"), 6), c = rep(c("one", "two", "three"), 4)) lookup <- data.frame(h = c(rep("one", 3), rep("two", 3)), c = rep(c("one", "two", "three"), 2), t = 21:26) data <- transform(data, h = as.factor(h), c = as.factor(c)) lookup <- transform(lookup, h = as.factor(h), c = as.factor(c)) temp <- merge(data, lookup) temp <- transform(temp, v.minus.t = v - t) 
+4
source

This is perfect for data.table , using without

 library(data.table) data <- as.data.table(data) lookup <- as.data.table(lookup) setkey(data, h, c) setkey(lookup, h,c) data[lookup, list(v,t, newValue = vt)] 
+1
source

All Articles