How to check if a file is compressed in R

What is the best way in R determine if a file is compressed or not? Is there any specific function to test this? I ask about something other than a file name extension, for example.

 grepl("^.*(.gz|.bz2|.tar|.zip|.tgz|.gzip|.7z)[[:space:]]*$", filename) 
+5
source share
3 answers

If you are using Linux (or similar), you can use the file command. For instance.

 file filename 

This will give you useful information about a number of formats, including, for example, if a file is compressed using gzip (one of the R formats can be read directly).

+2
source

In R, do the following:

 filetype = summary( file('yourfile.gz') )$class 

If it is compressed, filetype will be gzfile


Note. You can also assign the file to a variable and close the connection after

 filetype <- function(path){ f = file(path) ext = summary(f)$class close.connection(f) ext } 
+2
source

If you have Java installed, you can use the free Apache Tika tool to check file metadata.

Setup after boot:

 alias tika='java -jar /opt/java_shared/tika/tika-app-1.7.jar' 

analyze file (slow, takes ~ 5 seconds)

 tika -m chroma-1.15.tar.bz2 Content-Length: 2690725 Content-Type: application/x-bzip2 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.pkg.CompressorParser resourceName: chroma-1.15.tar.bz2 

Another example:

 echo "hi there" > notazipfile.zip tika -m notazipfile.zip Content-Encoding: ISO-8859-1 Content-Length: 9 Content-Type: text/plain; charset=ISO-8859-1 X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.txt.TXTParser resourceName: notazipfile.zip 

There is a help page:

 tika --help 

Long list:

 tika --list-supported-types | grep -C 3 bzip2 application/x-bzip supertype: application/octet-stream parser: org.apache.tika.parser.pkg.CompressorParser 

Again: checking for large files may take some time.

Please note that there is a website on which someone started to create an R-interface, but this webpage has been seemingly inactive since 2012: https://r-forge.r-project.org/projects/r- tika /

+1
source

All Articles