I have a DataFrame that I need to write to S3 according to a specific partition. The code is as follows:
dataframe .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .parquet(outputPath)
partitionBy splits data into a fairly large number of folders (~ 400) with a small amount of data (~ 1 GB) in each. And here a problem arises - since the default value of spark.sql.shuffle.partitions is 200, 1 GB of data in each folder is divided into 200 small parquet files, resulting in approximately 80,000 parquet files being written. This is not optimal for a number of reasons, and I would like to avoid it.
I can, of course, set spark.sql.shuffle.partitions to a much smaller number, say 10, but as I understand it, this parameter also controls the number of partitions to shuffle in connections and aggregation, so I really don't want to change this.
Does anyone know if there is another way to control how many files are recorded?
source share