I am trying to create Spark RDD from several json files compressed in tar. For example, I have 3 files
file1.json file2.json file3.json
And they are contained in archive.tar.gz .
I want to create a dataframe from json files. The problem is that Spark is not reading json files correctly. Creating an RDD using sqlContext.read.json("archive.tar.gz") or sc.textFile("archive.tar.gz") leads to distortion / addition of output.
Is there a way to handle gzipped archives containing multiple files in Spark?
UPDATE
Using the method indicated in the answer Read entire text files from compression in Spark I managed to run everything, but this method does not seem to be suitable for large tar.gz archives (> 200 mb compressed), because the application compresses large archive sizes. Since some of the archives I deal with sizes up to 2 GB after compression, I wonder if there is an effective way to deal with this problem.
I try to avoid extracting the archives and then merging the files as this will be time consuming.
scala gzip apache-spark rdd
septra
source share