How to save RDD spark in gzip format via pyspark

Question

How to save RDD spark in gzip format via pyspark

So, I save the spark RDD in the S3 bucket using the following code. Is there a way to compress (in gz format) and save instead of saving as a text file.

help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")

+6

python apache-spark pyspark

rclakmal Dec 10 '15 at 2:04

source share

1 answer

zero323 · Accepted Answer · 2015-12-10T16:49:33+0000

saveAsTextFile method takes an optional argument that defines the compression codec class:

 help_data.repartition(5).saveAsTextFile( path="s3://help-test/logs/help", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec" )

How to save RDD spark in gzip format via pyspark

More articles: