Download and extract .gz data file using R

I already tried to solve my problem by adapting this similar question . However, I get the following error for the url or file I want to do with.

trying URL 'http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz' Content type 'application/x-gzip' length 65933953 bytes (62.9 Mb) opened URL downloaded 62.9 Mb Show Traceback Rerun with Debug Error in open.connection(file, "rt") : cannot open the connection In addition: Warning message: In open.connection(file, "rt") : cannot open zip file 'D:....' 

here is what i tried:

 url_S_C <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz" tmpFile <- tempfile() fileName <- gsub(".gz","",basename(url_S_C)) download.file(url_S_C, tmpFile) data <- read.table(unz(tmpFile, fileName)) unlink(tmpFile) 

Maybe some of them can help me, why does this particular file not work for me? Please note that this file is quiet large (62.9 MB), but I could not reproduce the error with the URL from a similar question.

Thanks!

+7
r download decompression gz
source share
2 answers

Some additional options with base R:

 url <- "http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz" tmp <- tempfile() ## download.file(url,tmp) ## data <- read.csv( gzfile(tmp), sep="\t", header=TRUE, stringsAsFactors=FALSE) names(data)[1] <- sub("X\\.","",names(data)[1]) ## R> head(data) mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment 1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216 uuGAUAUGUUGGAUGAU-GGAGu 2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777 uugaUA-UGUU--GGAUGAUGGAGu 3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638 uugauaUGUUGG-AUGAUGGAgu 4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074 uuGAUAUGUUGGAUGAUGGAGu 5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504 uugaUAUGUUGGAUGAUGGAGu 6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962 uuGAUAUGUUGGAU-GAUGG-AGu alignment gene_alignment mirna_start mirna_end gene_start gene_end 1 | :|: ||:|| ||| |||| aaCGGUGAAAUCU-CUAGCCUCu 2 21 495 516 2 || |||: ::||||||||: acaaAUCACAGUUUUUACUACCUUc 2 19 459 483 3 |::||: |||||||| aauuucAUGACUGUACUACCUga 3 17 77 99 4 || || | | ||||||| ccCUCUAGA---UUCUACCUCa 2 21 1282 1300 5 :|| |: |||||||| guagGUAAAGGAAACUACCUCa 2 19 6410 6431 6 || || ||| || ||||| || uaCUUUAAAACAUAUCUACCAUCu 2 21 2265 2288 genome_coordinates conservation align_score seed_cat energy mirsvr_score 1 [hg19:2:224840068-224840089:-] 0.5684 122 0 -14.73 -0.7269 2 [hg19:17:37092945-37092969:-] 0.6464 140 0 -16.38 -0.1156 3 [hg19:20:3904018-3904040:+] 0.6522 139 0 -16.04 -0.2066 4 [hg19:6:56966300-56966318:+] 0.7627 144 7 -14.51 -0.8609 5 [hg19:12:24964511-24964532:-] 0.6775 150 7 -15.09 -0.2735 6 [hg19:20:11906579-11906602:+] 0.5740 131 0 -12.59 -0.2540 

Or, if you are using a Unix-like system, you can get the .txt file (outside of R or using system or system2 from R) as follows:

 [ nathan@nrussell tmp]$ url="http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz" [ nathan@nrussell tmp]$ wget "$url" && gunzip human_predictions_S_C_aug2010.txt.gz 

and then proceed as above, where you read human_predictions_S_C_aug2010.txt , wherever wget and gunzip ,

 data <- read.csv( "~/tmp/human_predictions_S_C_aug2010.txt", stringsAsFactors=FALSE, header=TRUE, sep="\t") 

in my case.

+5
source share

You can read data from a file in R as follows (tested on Windows):

 library(stringr) library(plyr) library(dplyr) # download and extract file from web temp <- tempfile() download.file("http://cbio.mskcc.org/microrna_data/human_predictions_S_C_aug2010.txt.gz", temp) gzfile(temp, 'rt') data <- read.csv(temp, stringsAsFactors = FALSE, nrows = 20) unlink(temp) # column names my_names <- str_split(names(data), "\\.") %>% unlist(.) # toy example using only first 6 rows of dataset mickey_mouse_data <- head(data) %>% unlist(.) %>% str_split(., "\t") %>% ldply(.) names(mickey_mouse_data) <- my_names[-1] tbl_df(mickey_mouse_data) mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id 1 MIMAT0000062 hsa-let-7a 5270 SERPINE2 uc002vnu.2 NM_006216 2 MIMAT0000062 hsa-let-7a 494188 FBXO47 uc002hrc.2 NM_001008777 3 MIMAT0000062 hsa-let-7a 80025 PANK2 uc002wkc.2 NM_153638 4 MIMAT0000062 hsa-let-7a 26036 ZNF451 uc003pdp.2 AK027074 5 MIMAT0000062 hsa-let-7a 586 BCAT1 uc001rgd.3 NM_005504 6 MIMAT0000062 hsa-let-7a 22903 BTBD3 uc002wnz.2 NM_014962 Variables not shown: mirna_alignment (chr), alignment (chr), gene_alignment (chr), mirna_start (chr), mirna_end (chr), gene_start (chr), gene_end (chr), genome_coordinates (chr), conservation (chr), align_score (chr), seed_cat (chr), energy (chr), mirsvr_score (chr) 
+2
source share

All Articles