Reading in multiple tar.gz compressed files in Spark

Question

Reading in multiple tar.gz compressed files in Spark

I am trying to create Spark RDD from several json files compressed in tar. For example, I have 3 files

file1.json file2.json file3.json

And they are contained in archive.tar.gz .

I want to create a dataframe from json files. The problem is that Spark is not reading json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") leads to distortion / addition of output.

Is there a way to handle gzipped archives containing multiple files in Spark?

UPDATE

Using the method indicated in the answer Read entire text files from compression in Spark I managed to run everything, but this method does not seem to be suitable for large tar.gz archives (> 200 mb compressed), because the application compresses large archive sizes. Since some of the archives I deal with sizes up to 2 GB after compression, I wonder if there is an effective way to deal with this problem.

I try to avoid extracting the archives and then merging the files as this will be time consuming.

+4

scala gzip apache-spark rdd

septra Jul 28 '16 at 12:06

source share

2 answers

Files inside the * .tar.gz file, as you already mentioned, are compressed. You cannot put these 3 files in a single compressed tar file and expect the import function (which searches only for text) to know how to handle unpacking files, unpack them from the tar archive, and then import each file individually.

I would recommend that you spend time downloading each individual json file manually, as the sc.textfile and sqlcontext.read.json functions cannot process compressed data.

-one

DJHenjin Jul 28 '16 at 12:15

source share

septra · Accepted Answer · 2016-07-28T12:51:57+0000

The solution is given in Read entire text files from compression in Spark . Using the provided code sample, I was able to create a data file from a compressed archive as follows:

 val jsonRDD = sc.binaryFiles("gzarchive/*"). flatMapValues(x => extractFiles(x).toOption). mapValues(_.map(decode()) val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works great for tar archives of relatively small size, but is not suitable for large archive sizes.

The best solution is to convert tar archives to hasoop SequenceFiles, which are shared and therefore can be read and processed in parallel in Spark (unlike tar archives).

See: stuartsierra.com/2008/04/24/a-million-little-files

Reading in multiple tar.gz compressed files in Spark

More articles: