I encountered a number of problems when trying to save a very large SchemaRDD, as in the Parquet format on S3. I have already asked specific questions on these issues, but this is what I really need to do. The code should look something like this.
import org.apache.spark._ val sqlContext = sql.SQLContext(sc) val data = sqlContext.jsonFile("s3n://...", 10e-6) data.saveAsParquetFile("s3n://...")
I'm having problems if I have more than 2000 partitions or if I have a partition larger than 5G. This sets the upper bound on the maximum size of SchemaRDD, I can handle this path. The prctical limit is closer to 1T, since the size of the partitions varies greatly, and you only need 1 5G partition for the process to fail.
Questions related to the specific problems I am facing are -
- Apache Spark multi-page downloads on Amazon S3
- Error writing SchemaRDD re-layout to parquet using Spark SQL
- Spark SQL cannot complete parquet data logging with a lot of shards
These questions relate to whether there are any solutions to the main goal that are not necessarily related to solving one of these problems directly.
To fix the problem, there are two problems.
Writing one shard in excess of 5G to S # fails. AFAIK is the built-in limit of s3n:// buckets. This should be possible for the s3:// buckets, but it does not seem to work from Spark, and hadoop distcp from local HDFS cannot do this either.
Writing a summary file tends to crash if there are 1000s of skulls. There seem to be a couple of problems. Writing directly to S3 causes an error in the related question. Writing directly to a local HDFS causes an OOM error even on r3.8xlarge (244G ram) once when there are about 5,000 fragments. This, apparently, does not depend on the actual amount of data. The composite file seems necessary for an efficient request.
Taken together, these concerns limit S3's Parquet tables to 25T. In practice, this is actually significantly smaller, since the size of the fragments can vary widely within the RDD, and the 5G limit applies to the largest fragment.
How can I write → 25T RDD as Parquet for S3?
I am using Spark-1.1.0.
amazon-s3 apache-spark apache-spark-sql parquet
Daniel Mahler
source share