Fast concatenation of thousands of files in columns

I use R to bind around ~ 11000 files using:

 dat <- do.call('bind_cols',lapply(lfiles,read.delim)) 

which is incredibly slow. I use R because my downstream processing, like creating graphs, etc., is in R. What are some quick alternatives to concatenating thousands of files in columns?

I have three types of files for which I want to do this. They look like this:

 [ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv target_id length eff_length est_counts tpm ENST00000619216.1 68 26.6432 10.9074 5.69241 ENST00000473358.1 712 525.473 0 0 ENST00000469289.1 535 348.721 0 0 ENST00000607096.1 138 15.8599 0 0 ENST00000417324.1 1187 1000.44 0.0673096 0.000935515 ENST00000461467.1 590 403.565 3.22654 0.11117 ENST00000335137.3 918 731.448 0 0 ENST00000466430.5 2748 2561.44 162.535 0.882322 ENST00000495576.1 1319 1132.44 0 0 [ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab gene_id C021_0011_001786_tumor_RNASeq TSPAN6 1979.7185 TNMD 1.321 DPM1 1878.8831 SCYL3 452.0372 C1orf112 203.6125 FGR 494.049 CFH 509.8964 FUCA2 1821.6096 GCLC 1557.4431 [ centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab gene_id CPBT_0009_1_tumor_RNASeq ENSG00000000003.14 2005.0934 ENSG00000000005.5 5.0934 ENSG00000000419.12 1100.1698 ENSG00000000457.13 2376.9100 ENSG00000000460.16 1536.5025 ENSG00000000938.12 443.1239 ENSG00000000971.15 1186.5365 ENSG00000001036.13 1091.6808 ENSG00000001084.10 1602.7165 

Thanks!

+2
source share
2 answers

To read files quickly, we can use fread from data.table and then rbind list of data.table using rbindlist , specifying idcol=TRUE to provide a grouping variable to identify each of the data sets

 library(data.table) DT <- rbindlist(lapply(lfiles, fread), idcol=TRUE) 
+2
source

If you have all the numeric data, you can first convert them to a matrix, which can be significantly faster than data frames:

 > microbenchmark( do.call(cbind, rep(list(sleep), 1000)), do.call(cbind, rep(list(as.matrix(sleep)), 1000)) ) Unit: microseconds expr min lq mean do.call(cbind, rep(list(sleep), 1000)) 6978.635 7496.690 8038.21531 do.call(cbind, rep(list(as.matrix(sleep)), 1000)) 636.282 722.814 862.01125 median uq max neval 7864.180 8397.8595 12213.473 100 744.647 793.0695 7416.430 100 

Alternatively, if you need a data frame, you can cheat using unlist , and then manually set the class:

 df <- unlist(rep(list(sleep), 1000), recursive=FALSE) class(df) <- 'data.frame' 
+2
source

All Articles