Add new column to existing parquet file

Can I add a new column to an existing parquet file?

I am currently working on a kaggle contest and I have converted all the data into parquet files.

In this case, I read the parquet file in the pyspark DataFrame, made some extraction functions and added new columns to the DataFrame with

pysaprk.DataFrame.withColumn ().

After that I want to save the new columns in the original parquet file.

I know that Spark SQL comes with parquet schema evolution , but the example showed only the key case.

Parquet add mode also does not. It only adds new lines to the parquet file. Should in any case add a new column to an existing parquet file instead of generating the entire table again? Or I need to create a separate new parquet file and join them at runtime.

+4
source share
2 answers

Although this question has been published for 2 years and still has not received an answer, allow yourself to answer my own question.

While I was still working with Spark, the version of Spark was 1.4. I am not for new versions, but for this version it is impossible to add a new column to the parquet file.

+1
source

In the parquet you do not modify the files, you read them, modify and write them back, you cannot just change the column that you need to read and write the full file.

0
source

All Articles