The question asks a faster way to subset the rows of the data frame. The fastest way is with data.table.
set.seed(1) # for reproducible example # 1 million rows - big enough? df <- data.frame(age=sample(1:65,1e6,replace=TRUE),x=rnorm(1e6),y=rpois(1e6,25)) library(microbenchmark) microbenchmark(result<-df[which(df$age>5),], result<-subset(df, age>5), result<-df[df$age>5,], times=10) # Unit: milliseconds # expr min lq median uq max neval # result <- df[which(df$age > 5), ] 77.01055 80.62678 81.43786 133.7753 145.4756 10 # result <- subset(df, age > 5) 190.89829 193.04221 197.49973 203.7571 263.7738 10 # result <- df[df$age > 5, ] 169.85649 171.02084 176.47480 185.9394 191.2803 10 library(data.table) DT <- as.data.table(df) # data.table microbenchmark(DT[age > 5],times=10) # Unit: milliseconds # expr min lq median uq max neval # DT[age > 5] 29.49726 29.93907 30.1813 30.67168 32.81204 10
So, in this simple case, data.table is slightly more than two times faster than which(...) , and more than 6 times faster than subset(...) .
jlhoward
source share