Thus, we run the spark task, which extracts data and performs extensive data conversion and writes them to several different files. Everything works fine, but I get random huge delays between completing the resource-intensive work and starting the next work.
In the figure below, we see that the task, which was scheduled for 17:22:02, took 15 minutes, which means that I expect the next work to be scheduled for 17:37:02. However, the next work was scheduled for 22:05:59, that is, +4 hours after successful work.
When I delve into the following interface of a task spark, it shows a scheduler delay of <1 second. So I donโt understand where this 4 hour delay comes from.

(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below about how Spark's I / O is handled is a bit unexpected. (It makes sense that writing a file essentially โcollectsโ behind the scenes before recording, given the order and / or other operations.) But I am a little confused by the fact that the I / O time is not included during the execution of the task. I suppose you can see it on the "SQL" tab in the spark user interface, since the queries are still executed even if all the tasks were completed successfully, but you can't go into them at all.
I'm sure there are other ways to improve, but two methods were enough for me:
- reduce the number of files
- set
parquet.enable.summary-metadata to false
source share