How to control the number of parquet files generated when using partitionBy

I have a DataFrame that I need to write to S3 according to a specific partition. The code is as follows:

 dataframe .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .parquet(outputPath) 

partitionBy splits data into a fairly large number of folders (~ 400) with a small amount of data (~ 1 GB) in each. And here a problem arises - since the default value of spark.sql.shuffle.partitions is 200, 1 GB of data in each folder is divided into 200 small parquet files, resulting in approximately 80,000 parquet files being written. This is not optimal for a number of reasons, and I would like to avoid it.

I can, of course, set spark.sql.shuffle.partitions to a much smaller number, say 10, but as I understand it, this parameter also controls the number of partitions to shuffle in connections and aggregation, so I really don't want to change this.

Does anyone know if there is another way to control how many files are recorded?

+6
source share
1 answer

As you noted correctly, spark.sql.shuffle.partitions applies only to shuffles and join in SparkSQL.

partitionBy in a DataFrameWriter (you move from DateFrame to DateFrameWriter as soon as you call write ) just works with the previous number of partitions. (The partitionBy writer only assigns the columns of the table / parquet file to be written out, so it has nothing to do with the number of partitions. This is a bit confusing.)

In short, just remake the DataFrame before converting it to a writer.

+6
source

All Articles