I wrote a previous (similar) post here , where I tried to create a wide table, not a long table. I realized that it is best to have a table in a long format, so I am posting it as another question. I also post what I tried.
I use R to access about ~ 11000 files using:
# get list of ~11000 files lfiles <- list.files(pattern = "*.tsv", full.names = TRUE)
I use R because my flow processing, like creating graphs, etc., is in R. I have two questions:
1. Is there a way to make my code more efficient / faster? I know the number of rows expected at the end, so this will help if I preallocate the data frame?
2. How to save (in what format) this huge data - like .RData or as a database or something else?
As additional information: I have three types of files for which I want to do this. They look like this:
[ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv target_id length eff_length est_counts tpm ENST00000619216.1 68 26.6432 10.9074 5.69241 ENST00000473358.1 712 525.473 0 0 ENST00000469289.1 535 348.721 0 0 ENST00000607096.1 138 15.8599 0 0 ENST00000417324.1 1187 1000.44 0.0673096 0.000935515 ENST00000461467.1 590 403.565 3.22654 0.11117 ENST00000335137.3 918 731.448 0 0 ENST00000466430.5 2748 2561.44 162.535 0.882322 ENST00000495576.1 1319 1132.44 0 0 [ centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab gene_id C021_0011_001786_tumor_RNASeq TSPAN6 1979.7185 TNMD 1.321 DPM1 1878.8831 SCYL3 452.0372 C1orf112 203.6125 FGR 494.049 CFH 509.8964 FUCA2 1821.6096 GCLC 1557.4431 [ centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab gene_id CPBT_0009_1_tumor_RNASeq ENSG00000000003.14 2005.0934 ENSG00000000005.5 5.0934 ENSG00000000419.12 1100.1698 ENSG00000000457.13 2376.9100 ENSG00000000460.16 1536.5025 ENSG00000000938.12 443.1239 ENSG00000000971.15 1186.5365 ENSG00000001036.13 1091.6808 ENSG00000001084.10 1602.7165
Any help would be greatly appreciated!
Thanks!
source share