Add new column to existing parquet file

Question

Add new column to existing parquet file

Can I add a new column to an existing parquet file?

I am currently working on a kaggle contest and I have converted all the data into parquet files.

In this case, I read the parquet file in the pyspark DataFrame, made some extraction functions and added new columns to the DataFrame with

pysaprk.DataFrame.withColumn ().

After that I want to save the new columns in the original parquet file.

I know that Spark SQL comes with parquet schema evolution , but the example showed only the key case.

Parquet add mode also does not. It only adds new lines to the parquet file. Should in any case add a new column to an existing parquet file instead of generating the entire table again? Or I need to create a separate new parquet file and join them at runtime.

+4

apache-spark apache-spark-sql parquet

Chu-yu hsu Aug 4 '15 at 15:01

source share

2 answers

Chu-yu hsu · Answer 1 · 2017-01-04T09:18:11+0000

Although this question has been published for 2 years and still has not received an answer, allow yourself to answer my own question.

While I was still working with Spark, the version of Spark was 1.4. I am not for new versions, but for this version it is impossible to add a new column to the parquet file.

Daniel Sobrado · Answer 2 · 2017-01-21T07:24:29+0000

In the parquet you do not modify the files, you read them, modify and write them back, you cannot just change the column that you need to read and write the full file.

Add new column to existing parquet file

More articles: