Why is the independent Spark report "ExecutorLostFailure (driver-driver lost)" with cogroup?

I use the cogroup functionality (for two datasets - one 9 GB and the other 110 KB), running Spark offline, as shown below:

I have 128 GB RAM and 24 cores. My settings:

 set("spark.executor.memory","64g") set("spark.driver.memory","64g") 

IntelliJ VM Options: -Xmx128G

As you can see from the code, I split the data into 1000 parts. I also tried 5000 and 10000, respectively, since countByKey very expensive in my case.

From some other StackOverflow posts, I saw the spark.default.parallelism parameter. How do I configure my configurations? Do I need to add something else to the IntelliJ VM option? Should I use spark.default.parallelism ?

 val emp = sc.textFile("\\text1.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))} val emp_new = sc.textFile("\\text2.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))} val cog = emp.cogroup(emp_new) val skk = cog.flatMap { case (key: String, (l1: Iterable[String], l2: Iterable[String])) => for { e1 <- l1.toSeq; e2 <- l2.toSeq } yield ((e1, e2), 1) } val com = skk.countByKey() 

When 1000 and 5000 sections are used countByKey too much spilling ends, when the 10000-section is used, I started to get some results, at least some tasks completed. But after a while I got errors like the one below:

 15/10/06 14:01:17 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 451457 ms exceeds timeout 120000 ms 15/10/06 14:01:17 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 451457 ms 15/10/06 14:01:17 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 2.0 15/10/06 14:01:17 WARN TaskSetManager: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost) 15/10/06 14:01:17 ERROR TaskSetManager: Task 109 in stage 2.0 failed 1 times; aborting job 15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 91), so marking it as still running 15/10/06 14:01:17 WARN TaskSetManager: Lost task 34.0 in stage 2.0 (TID 20036, localhost): ExecutorLostFailure (executor driver lost) 15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 118), so marking it as still running 15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 100), so marking it as still running 15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 76), so marking it as still running 

...

 15/10/06 14:01:17 INFO TaskSchedulerImpl: Cancelling stage 2 15/10/06 14:01:17 INFO DAGScheduler: ShuffleMapStage 2 (countByKey at ngram.scala:39) failed in 1020,915 s 15/10/06 14:01:17 INFO DAGScheduler: Job 0 failed: countByKey at ngram.scala:39, took 3025,563964 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 109 in stage 2.0 failed 1 times, most recent failure: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) 
+6
source share

All Articles