How to control the number of parquet files generated when using partitionBy

Question

How to control the number of parquet files generated when using partitionBy

I have a DataFrame that I need to write to S3 according to a specific partition. The code is as follows:

 dataframe .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .parquet(outputPath)

partitionBy splits data into a fairly large number of folders (~ 400) with a small amount of data (~ 1 GB) in each. And here a problem arises - since the default value of spark.sql.shuffle.partitions is 200, 1 GB of data in each folder is divided into 200 small parquet files, resulting in approximately 80,000 parquet files being written. This is not optimal for a number of reasons, and I would like to avoid it.

I can, of course, set spark.sql.shuffle.partitions to a much smaller number, say 10, but as I understand it, this parameter also controls the number of partitions to shuffle in connections and aggregation, so I really don't want to change this.

Does anyone know if there is another way to control how many files are recorded?

+6

apache-spark spark-dataframe

Glennie helles sindholt Nov 20 '15 at 9:29

source share

1 answer

Marius soutier · Accepted Answer · 2015-11-20T14:15:52+0000

As you noted correctly, spark.sql.shuffle.partitions applies only to shuffles and join in SparkSQL.

partitionBy in a DataFrameWriter (you move from DateFrame to DateFrameWriter as soon as you call write ) just works with the previous number of partitions. (The partitionBy writer only assigns the columns of the table / parquet file to be written out, so it has nothing to do with the number of partitions. This is a bit confusing.)

In short, just remake the DataFrame before converting it to a writer.

How to control the number of parquet files generated when using partitionBy

More articles: