Using the split function to group data by factor, alternatives for large data frames

I have a question regarding the use of the split function to group data using factor .

I have a data frame from two columns of snps and a gene. Snps is a factor, gene is a symbol vector. I want to group genes using the snp parameter so that I can see the list of genes that map to each snp. Some snps can map to more than one gene, for example, rs10000226 matches the 345274 gene and the 5783 gene, and the genes occur several times.

To do this, I used the split function to make a list of genes, each of which is attached to snp.

 snps<-c("rs10000185", "rs1000022", "rs10000226", "rs10000226") gene<-c("5783", "171425", "345274", "5783") df<-data.frame(snps, gene) # snps is a factor df$gene<-as.character(df$gene) splitted=split(df, df$gene, drop=T) # group by gene snpnames=unique(df$snps) df.2<-lapply(splitted, function(x) { x["snps"] <- NULL; x }) # remove the snp column names(df.2)=snpnames # rename the list elements by snp df.2 = sapply(df.2, function(x) list(as.character(x$gene))) save(df.2, file="df.2.rda") 

However, this is not effective for my complete data frame (probably because of its size - 363422 lines, 281370 unique snps, 20888 unique genes) and R crashes when trying to load df.2.rda` later.

Any suggestions on alternative ways to do this would be much appreciated!

+5
source share
1 answer

There is a shorter way to create df.2 :

 genes_by_snp <- split(df$gene,df$snp) 

You can look at the genes for a given snp using genes_by_snp[["rs10000226"]] .


Your dataset doesn’t sound like much to me, but you might not create the list above, preserving the original data in different ways. Extending @AnandoMahto's comment, here's how to use the data.table package:

 require(data.table) setDT(df) setkey(df,snps) 

You can look at the genes for a given snp using df[J("rs10000226")] .

+2
source

All Articles