A quick way to select rows in a table in R?

I am looking for a quick way to extract a large number of rows from an even larger table. The top of my table is as follows:

> head(dbsnp) snp gene distance rs5 rs5 KRIT1 1 rs6 rs6 CYP51A1 1 rs7 rs7 LOC401387 1 rs8 rs8 CDK6 1 rs9 rs9 CDK6 1 rs10 rs10 CDK6 1 

And sizes:

 > dim(dbsnp) [1] 11934948 3 

I want to select the rows with the names of the growths contained in the list:

 > head(features) [1] "rs1367830" "rs5915027" "rs2060113" "rs1594503" "rs1116848" "rs1835693" > length(features) [1] 915635 

Unsurprisingly, the easy way to do this temptable = dbsnp[features,] is quite time consuming.

I was looking for ways to do this through the sqldf package in R. I thought it could be faster. Unfortunately, I cannot figure out how to select rows with specific growth names in SQL.

Thanks.

+7
source share
3 answers

As most people will first try to do this:

 dbsnp[ rownames(dbsnp) %in% features, ] # which is probably slower than your code 

Since you say it takes a long time, I suspect that you have exceeded your memory capacity and started using virtual memory. You should turn off the system and then restart it using only R as a running application and see if you can avoid the "virtual" one.

+4
source

Solution data.table :

 library(data.table) dbsnp <- structure(list(snp = c("rs5", "rs6", "rs7", "rs8", "rs9", "rs10" ), gene = c("KRIT1", "CYP51A1", "LOC401387", "CDK6", "CDK6", "CDK6"), distance = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("snp", "gene", "distance"), class = "data.frame", row.names = c("rs5", "rs6", "rs7", "rs8", "rs9", "rs10")) DT <- data.table(dbsnp, key='snp') features <- c('rs5', 'rs7', 'rs9') DT[features] snp gene distance 1: rs5 KRIT1 1 2: rs7 LOC401387 1 3: rs9 CDK6 1 
+10
source

With sqldf you need rownames = TRUE , then you can query for outlet names with row_names :

 library(sqldf) ## input test<-read.table(header=T,text=" snp gene distance rs5 rs5 KRIT1 1 rs6 rs6 CYP51A1 1 rs7 rs7 LOC401387 1 rs8 rs8 CDK6 1 rs9 rs9 CDK6 1 rs10 rs10 CDK6 1 ") features<-c("rs5","rs7","rs10") ## calculate inVar <- toString(shQuote(features, type = "csh")) # 'rs5','rs7','rs10' fn$sqldf("SELECT * FROM test t WHERE t.row_names IN ($inVar)" , row.names = TRUE) ## result # snp gene distance #rs5 rs5 KRIT1 1 #rs7 rs7 LOC401387 1 #rs10 rs10 CDK6 1 

UPDATE: alternatively, if fet is a data frame, the features column contains the necessary elements to search for:

 fet <- data.frame(features) sqldf("SELECT t.* FROM test t WHERE t.row_names IN (SELECT features FROM fet)" , row.names = TRUE) 

Also, if the data was big enough, we could speed it up with indexes. For this and other details, see the sqldf homepage .

+5
source

All Articles