Spark: Long Delay Between Jobs

Question

Spark: Long Delay Between Jobs

Thus, we run the spark task, which extracts data and performs extensive data conversion and writes them to several different files. Everything works fine, but I get random huge delays between completing the resource-intensive work and starting the next work.

In the figure below, we see that the task, which was scheduled for 17:22:02, took 15 minutes, which means that I expect the next work to be scheduled for 17:37:02. However, the next work was scheduled for 22:05:59, that is, +4 hours after successful work.

When I delve into the following interface of a task spark, it shows a scheduler delay of <1 second. So I don’t understand where this 4 hour delay comes from.

(Spark 1.6.1 with Hadoop 2)

Updated:

I can confirm that David's answer below about how Spark's I / O is handled is a bit unexpected. (It makes sense that writing a file essentially “collects” behind the scenes before recording, given the order and / or other operations.) But I am a little confused by the fact that the I / O time is not included during the execution of the task. I suppose you can see it on the "SQL" tab in the spark user interface, since the queries are still executed even if all the tasks were completed successfully, but you can't go into them at all.

I'm sure there are other ways to improve, but two methods were enough for me:

reduce the number of files
set parquet.enable.summary-metadata to false

+14

scala hadoop apache-spark

codingtwinky Apr 10 '16 at 2:08

source share

2 answers

Problem :

I encountered a similar problem when writing parquet data to s3 with pyspark on EMR 5.5.1 . All employees would finish writing data in _temporary in the output folder, and the Spark user interface would show that all tasks were completed. But the Hadoop Resource Manager user interface will not free up resources for the application and will not mark it as complete. When checking the s3 segment, it seemed that the spark driver moved files 1 to 1 from _temporary to the _temporary output which was very slow, and the entire cluster was idle, except for the Driver node.

Solution :

The solution that worked for me was to use the EmrOptimizedSparkSqlParquetOutputCommitter AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) class by EmrOptimizedSparkSqlParquetOutputCommitter spark.sql.parquet.fs.optimized.committer.optimization-enabled configuration spark.sql.parquet.fs.optimized.committer.optimization-enabled to true .

eg:

spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled = true

or

pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled = true

Please note that this property is available in EMR 5.19 or higher.

Result :

After starting the spark task on EMR 5.20.0 using the above solution, he did not create any _temporary and all the files were directly written to the output segment, so the task completed very quickly.

For more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

+4

Abdul mannan Jan 24 '19 at 16:03

source share

David · Accepted Answer · 2016-04-11T15:25:38+0000

I / O operations are often accompanied by significant overhead on the main node. Since this work is not parallelized, it can take quite a while. And since this is not a job, it does not appear in the resource manager user interface. Some examples of I / O tasks performed by a host

Spark writes to temporary s3 directories, then moves files using the main node
Reading text files often happens on the main node
When recording parquet files, the master node scans all files after recording to check the scheme

These problems can be solved by adjusting the yarn parameters or changing the code. If you provide some source code, I can pinpoint your problem.

Discussion of writing I / O Overhead with Parquet and s3

Discussion of reading I / O Overhead "s3 is not a file system"

Spark: Long Delay Between Jobs

More articles: