I had a problem when I have Parquet data as daily chunks in S3 (in the form s3://bucketName/prefix/YYYY/MM/DD/ ), but I can not read the data in AWS EMR Spark from different dates, because that some types of columns do not match, and I get one of many exceptions, for example:
java.lang.ClassCastException: optional binary element (UTF8) is not a group
appears when an array type exists in some files that matters, but the same column can be null in other files, which are then output as row types.
or
org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 42.0 failed 4 times, most recent failure: Lost task 23.3 in stage 42.0 (TID 2189, ip-172-31-9-27.eu-west-1.compute.internal): org.apache.spark.SparkException: Failed to merge incompatible data types ArrayType(StructType(StructField(Id,LongType,true), StructField(Name,StringType,true), StructField(Type,StringType,true)),true)
I have the source data in S3 in JSON format, and my initial plan was to create an automatic task that starts the EMR cluster, reads the JSON data from the previous date and just writes it as parquet back to S3.
JSON data is also divided by date, i.e. keys have a date prefix. Reading JSON works great. The schema is derived from the data regardless of how much data is currently being read.
But the problem arises when writing parquet files. As far as I understand, when I write parquet with metadata files, these files contain a diagram for all parts / sections of parquet files. Which, in my opinion, can also be with different schemes. When I turned off metadata recording, Spark was said to have deduced the entire schema from the first file within a given Parquet path and assumed that it remains unchanged through other files.
When some columns, which should be double type, have only integer values ββfor a given day, reading them from JSON (which has these numbers as integers without floating points) makes Spark think that it is a column of type long . Even if I can hide these columns before writing Parquet files, itβs still not as good as the layout can change, you can add new columns and itβs impossible to track.
I saw that some people have the same problems, but I still have to find a good enough solution.
What are the best methods or solutions for this?