Error loading Apache Spark to download S3

I see a serious performance issue when Apache Spark uploads its results to S3. In my understanding, these steps are going ...

  • The output of the final stage is written to the table _temp/in HDFS, and its contents are moved to a folder "_temporary"inside a specific folder S3.

  • Once the whole process is complete, Apache aggregation completes the step saveAsTextFile, and then the files in the folder "_temporary"are S3moved to the main folder. It actually takes a lot of time (about 1 min per file (average size: 600 MB BZ2)]. This part is not logged in a regular log stderr.

I am using Apache Spark 1.0.1with Hadoop 2.2in AWS EMR.

Has anyone encountered this issue?

Update 1

How to increase the number of threads performing this move process?

Any suggestion is much appreciated ...

thank

+4
source share
2 answers

This has been fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). Which was included in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ).

+4
source

I am using below features. it uploads the file to s3. it downloads about 60 gb, gz files in 4-6 minutes.

        ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
                ",");
        counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
                TextOutputFormat.class);

Make sure you create more output files. more smaller files will make downloading faster.

API saveAsHadoopFile [F <: org.apache.hadoop.mapred.OutputFormat [_,]] (: String, keyClass: [], valueClass: [], outputFormatClass: [F], : [<: org.apache.hadoop.io.compress.CompressionCodec]): RDD Hadoop , .

0

All Articles