Can I add a new column to an existing parquet file?
I am currently working on a kaggle contest and I have converted all the data into parquet files.
In this case, I read the parquet file in the pyspark DataFrame, made some extraction functions and added new columns to the DataFrame with
pysaprk.DataFrame.withColumn ().
After that I want to save the new columns in the original parquet file.
I know that Spark SQL comes with parquet schema evolution , but the example showed only the key case.
Parquet add mode also does not. It only adds new lines to the parquet file. Should in any case add a new column to an existing parquet file instead of generating the entire table again? Or I need to create a separate new parquet file and join them at runtime.
source share