I filter in Spark using yarn and get the following error. Any help is appreciated, but my main question is why the file was not found.
/ hdata / 10 / thread / nm / usercache / spettinato / appcache / application_1428497227446_131967 / spark local 20150708124954-aa00 / 05 / merged_shuffle_1_343_1
It seems Spark cannot find the file that was saved in HDFS after shuffling.
Why is Spark accessing the "/ hdata /" directory? This directory does not exist in HDFS, should it be a local directory or an HDFS directory?
Can I customize the location where shuffled data is stored?
15/07/08 12:57:03 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: /hdata/10/yarn/nm/usercache/spettinato/appcache/application_1428497227446_131967/spark-local-20150708124954-aa00/05/merged_shuffle_1_343_1 (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
EDIT: I realized some of this. The directory configured by spark.local.dir is the local directory used to store RDD to disk according to http://spark.apache.org/docs/latest/configuration.html
apache-spark
0111001101110000
source share