How does Spark on Yarn store shuffled files?

Question

How does Spark on Yarn store shuffled files?

I filter in Spark using yarn and get the following error. Any help is appreciated, but my main question is why the file was not found.

/ hdata / 10 / thread / nm / usercache / spettinato / appcache / application_1428497227446_131967 / spark local 20150708124954-aa00 / 05 / merged_shuffle_1_343_1

It seems Spark cannot find the file that was saved in HDFS after shuffling.

Why is Spark accessing the "/ hdata /" directory? This directory does not exist in HDFS, should it be a local directory or an HDFS directory?
Can I customize the location where shuffled data is stored?

15/07/08 12:57:03 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /hdata/10/yarn/nm/usercache/spettinato/appcache/application_1428497227446_131967/spark-local-20150708124954-aa00/05/merged_shuffle_1_343_1 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

EDIT: I realized some of this. The directory configured by spark.local.dir is the local directory used to store RDD to disk according to http://spark.apache.org/docs/latest/configuration.html

+7

apache-spark

0111001101110000 Jul 08 '15 at 20:56

source share

2 answers

I suggest checking the remaining space on your system. I would say, like Carlos, that the task has died, and the reason is that the spark could not write the file in random order due to lack of space.

Try grepping java.io.IOException: No space left on device in the directory. / work of your workers.

+4

Bacon Jul 17 '15 at 20:49

source share

Carlos Rendon · Accepted Answer · 2015-07-15T23:21:28+0000

Most likely, the answer is that the task has died. For example, from OutOfMemory or another exception.

How does Spark on Yarn store shuffled files?

More articles: