“Random drag and drop time” is the time during which the tasks spent on locking were waiting for the shuffle data to be read from the remote computers. The exact metric from which it comes is shuffleReadMetrics.fetchWaitTime.
It is difficult to contribute to a mitigation strategy without actually knowing what data you are trying to read or what remote computers you are reading. However, consider the following:
- Check the connection to the remote computers from which you are reading data.
- Check your code / assignments to make sure that you are only reading the data that you definitely need to read in order to finish the job.
- In some cases, you may consider dividing your work into several tasks that are performed in parallel, provided that they are independent of each other.
- Perhaps you could upgrade your cluster to have more nodes so that you can split the workload to be more granular and therefore have an overall shorter waiting time.
Regarding metrics, this documentation should shed light on them: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-webui-StagePage.html
Finally, it was also difficult for me to find information about Shuffle Read Blocked Time, but if you put in quotation marks, for example: “Shuffle Read Blocked Time” in a Google search, you will find good results.
source share