We are trying to imagine a spark task (spark 2.0, hadoop 2.7.2), but for some reason we get a rather cryptic NPE in EMR. Everything works fine, like the scala program, so we are not sure what causes the problem. Here's the stack trace:
18: 02: 55,271 Utility ERROR: 91 - Job Interrupt java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.agg_doAggregateWithKeys $ (Unknown source) at org.apache.spark.sql. .expressions.GeneratedClass $ GeneratedIterator.processNext (Unknown source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext (BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec $ $$on $ anon $ 1.hasNext (WholeStageCodegenExec.scala: 370) in scala.collection.Iterator $$ anon $ 12.hasNext (Iterator.scala: 438) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp (WriterContainer.scala: 253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply (WriterContainer.scala: 252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply (WriterContainer.scala: 252) at org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks (1325) atala org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows (WriterContainer.scala: 258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ $ $$ $ apply $ 1 $$$ sp $ 1.apply (InsertIntoHadoopFsRelationCommand.scala: 143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply (InsertIntoalado at org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala: 70) at org.apache.spark.scheduler.Task.run (Task.scala: 85) at org.apache.spark.executor.Executor $ TaskRunner .run (Executor.scala: 274) in java.util.concurrent.ThreadPoolExecut or.runWorker (ThreadPoolExecutor.java:1142) in java.util.concurrent.ThreadPoolExecutor $ Worker.run (ThreadPoolExecutor.java:617) in java.lang.Thread.run (Thread.java:745)
As far as we know, this happens according to the following method:
def process(dataFrame: DataFrame, S3bucket: String) = { dataFrame.map(row => "text|label" ).coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket) }
We narrowed it down to a map function, since this works when presented as a spark task:
def process(dataFrame: DataFrame, S3bucket: String) = { dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket) }
Does anyone know what might cause this problem? Also, how can we solve this? We are very excited.
scala hadoop distributed-computing bigdata apache-spark
cscan Aug 17 '16 at 1:22 2016-08-17 01:22
source share