The spark breaks, possibly due to binaryFiles () with over 1 million files in HDFS

Question

The spark breaks, possibly due to binaryFiles () with over 1 million files in HDFS

I read millions of xml files through

val xmls = sc.binaryFiles(xmlDir)

The operation is performed normally locally, but on the thread it fails:

  client token: N/A diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt appattempt_1433491939773_0012_000002 timed out. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1433750951883 final status: FAILED tracking URL: http://controller01:8088/cluster/app/application_1433491939773_0012 user: ariskk Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:622) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

In hadoops / userlogs, I often get these messages:

 15/06/08 09:15:38 WARN util.AkkaUtils: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;@2b4f336b,BlockManagerId(1, controller01.stratified, 58510))] in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)

I run my spark job through spark-submit, and it works for another HDFS directory containing only 37k files. Any ideas how to solve this?

+3

hadoop apache-spark

kostas.kougios Jun 08 '15 at 8:52

source share

1 answer

kostas.kougios · Answer 1 · 2015-06-08T15:15:30+0000

Well, having received some help on the sparks mailing list, I found that there were 2 problems:

the src directory, if it is listed as / my _dir /, it causes a spark and causes heart problems. Instead, it should be specified as hdfs: /// my_dir / *
An error in memory error appears in the logs after fixing # 1. This is a spark driver that runs on yarn and finishes due to the number of files (apparently, it stores all the information about the file in memory). So, I twisted-sent the job using --conf spark.driver.memory = 8g, which fixed the problem.

The spark breaks, possibly due to binaryFiles () with over 1 million files in HDFS

More articles: