R: The string binds a very large number of files in a quick manner

Question

R: The string binds a very large number of files in a quick manner

I wrote a previous (similar) post here , where I tried to create a wide table, not a long table. I realized that it is best to have a table in a long format, so I am posting it as another question. I also post what I tried.

I use R to access about ~ 11000 files using:

 # get list of ~11000 files lfiles <- list.files(pattern = "*.tsv", full.names = TRUE) # row-bind the files # use rbindlist to rbind and fread to read files # use mclapply I am assigning 32 cores to it # add the file basename as the id to identify rows dat <- rbindlist(mclapply(lfiles, function(X) { data.frame(id = basename(tools::file_path_sans_ext(X)), fread(X))},mc.cores = 32))

I use R because my flow processing, like creating graphs, etc., is in R. I have two questions:

1. Is there a way to make my code more efficient / faster? I know the number of rows expected at the end, so this will help if I preallocate the data frame?

2. How to save (in what format) this huge data - like .RData or as a database or something else?

As additional information: I have three types of files for which I want to do this. They look like this:

 [ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv target_id length eff_length est_counts tpm ENST00000619216.1 68 26.6432 10.9074 5.69241 ENST00000473358.1 712 525.473 0 0 ENST00000469289.1 535 348.721 0 0 ENST00000607096.1 138 15.8599 0 0 ENST00000417324.1 1187 1000.44 0.0673096 0.000935515 ENST00000461467.1 590 403.565 3.22654 0.11117 ENST00000335137.3 918 731.448 0 0 ENST00000466430.5 2748 2561.44 162.535 0.882322 ENST00000495576.1 1319 1132.44 0 0 [ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab gene_id C021_0011_001786_tumor_RNASeq TSPAN6 1979.7185 TNMD 1.321 DPM1 1878.8831 SCYL3 452.0372 C1orf112 203.6125 FGR 494.049 CFH 509.8964 FUCA2 1821.6096 GCLC 1557.4431 [ centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab gene_id CPBT_0009_1_tumor_RNASeq ENSG00000000003.14 2005.0934 ENSG00000000005.5 5.0934 ENSG00000000419.12 1100.1698 ENSG00000000457.13 2376.9100 ENSG00000000460.16 1536.5025 ENSG00000000938.12 443.1239 ENSG00000000971.15 1186.5365 ENSG00000001036.13 1091.6808 ENSG00000001084.10 1602.7165

Any help would be greatly appreciated!

Thanks!

+6

r

Komal Rathi Aug 9 '16 at 14:38

source share

2 answers

As for question 1. , I can’t say for sure if there will be a noticeable difference, but you can try the following to avoid calling data.frame (as @Roland mentioned in his answer):

 lfiles <- list.files(pattern = ".*\\.tsv$", full.names = TRUE) setattr(lfiles, "names", basename(lfiles)) dat <- rbindlist(mclapply(lfiles, fread, mc.cores = 32), idcol = "id")

Here you can use the idcol argument inside rbindlist .

Regarding question 2. , I think it depends on what you want to do later in your analysis.

+3

docendo discimus Aug 9 '16 at 14:54

source share

Rolling · Accepted Answer · 2016-08-09T14:51:57+0000

You cannot do this faster than using fread and rbindlist in R. But you should not use data.frame and copy the data. Instead, assign the link :

 DF <- fread(X) DF[, id := basename(tools::file_path_sans_ext(X))] return(DF)

However, you should consider using a database.

PS: The correct regular expression is ".+\\.tsv$" . This matches any file name with one or more characters, followed by a period and the string “tsv”, followed by the end of the file name.

R: The string binds a very large number of files in a quick manner

More articles: