Saving >> 25T SchemaRDD in S3 Parquet format

Question

Saving >> 25T SchemaRDD in S3 Parquet format

I encountered a number of problems when trying to save a very large SchemaRDD, as in the Parquet format on S3. I have already asked specific questions on these issues, but this is what I really need to do. The code should look something like this.

import org.apache.spark._ val sqlContext = sql.SQLContext(sc) val data = sqlContext.jsonFile("s3n://...", 10e-6) data.saveAsParquetFile("s3n://...")

I'm having problems if I have more than 2000 partitions or if I have a partition larger than 5G. This sets the upper bound on the maximum size of SchemaRDD, I can handle this path. The prctical limit is closer to 1T, since the size of the partitions varies greatly, and you only need 1 5G partition for the process to fail.

Questions related to the specific problems I am facing are -

Apache Spark multi-page downloads on Amazon S3
Error writing SchemaRDD re-layout to parquet using Spark SQL
Spark SQL cannot complete parquet data logging with a lot of shards

These questions relate to whether there are any solutions to the main goal that are not necessarily related to solving one of these problems directly.

To fix the problem, there are two problems.

Writing one shard in excess of 5G to S # fails. AFAIK is the built-in limit of s3n:// buckets. This should be possible for the s3:// buckets, but it does not seem to work from Spark, and hadoop distcp from local HDFS cannot do this either.
Writing a summary file tends to crash if there are 1000s of skulls. There seem to be a couple of problems. Writing directly to S3 causes an error in the related question. Writing directly to a local HDFS causes an OOM error even on r3.8xlarge (244G ram) once when there are about 5,000 fragments. This, apparently, does not depend on the actual amount of data. The composite file seems necessary for an efficient request.

Taken together, these concerns limit S3's Parquet tables to 25T. In practice, this is actually significantly smaller, since the size of the fragments can vary widely within the RDD, and the 5G limit applies to the largest fragment.

How can I write → 25T RDD as Parquet for S3?

I am using Spark-1.1.0.

+7

amazon-s3 apache-spark apache-spark-sql parquet

Daniel Mahler Oct 13 '14 at 3:31

source share

1 answer

Adam ocsvari · Answer 1 · 2015-04-11T12:36:41+0000

From AWS S3 documentation:

The total amount of data and the number of objects that you can save is not limited. Individual Amazon S3 objects can be between 1 and 5 terabytes in size. The largest object that can be loaded in one PUT is 5 gigabytes. For objects larger than 100 megabytes, clients should consider the ability to download multi-page files.

One way around this is:

Attach the volume of EBS to your system, format it.
Copy the files to the "local" volume of EBS.
Snapshot volume, it automatically switches to your S3.

It also gives less load to your instance.

To access this data, you need to attach the snapshot as EBS to the instance.

Saving >> 25T SchemaRDD in S3 Parquet format

More articles: