NullPointerException in a Spark RDD map when presented as a spark job

Question

NullPointerException in a Spark RDD map when presented as a spark job

We are trying to imagine a spark task (spark 2.0, hadoop 2.7.2), but for some reason we get a rather cryptic NPE in EMR. Everything works fine, like the scala program, so we are not sure what causes the problem. Here's the stack trace:

18: 02: 55,271 Utility ERROR: 91 - Job Interrupt java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.agg_doAggregateWithKeys $ (Unknown source) at org.apache.spark.sql. .expressions.GeneratedClass $ GeneratedIterator.processNext (Unknown source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext (BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec $ $$on $ anon $ 1.hasNext (WholeStageCodegenExec.scala: 370) in scala.collection.Iterator $$ anon $ 12.hasNext (Iterator.scala: 438) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply $ mcV $ sp (WriterContainer.scala: 253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply (WriterContainer.scala: 252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer $$ anonfun $ writeRows $ 1.apply (WriterContainer.scala: 252) at org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks (1325) atala org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows (WriterContainer.scala: 258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ $ $$ $ apply $ 1 $$$ sp $ 1.apply (InsertIntoHadoopFsRelationCommand.scala: 143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand $$ anonfun $ run $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply (InsertIntoalado at org.apache.spark.scheduler.ResultTask.runTask (ResultTask.scala: 70) at org.apache.spark.scheduler.Task.run (Task.scala: 85) at org.apache.spark.executor.Executor $ TaskRunner .run (Executor.scala: 274) in java.util.concurrent.ThreadPoolExecut or.runWorker (ThreadPoolExecutor.java:1142) in java.util.concurrent.ThreadPoolExecutor $ Worker.run (ThreadPoolExecutor.java:617) in java.lang.Thread.run (Thread.java:745)

As far as we know, this happens according to the following method:

def process(dataFrame: DataFrame, S3bucket: String) = { dataFrame.map(row => "text|label" ).coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket) }

We narrowed it down to a map function, since this works when presented as a spark task:

 def process(dataFrame: DataFrame, S3bucket: String) = { dataFrame.coalesce(1).write.mode(SaveMode.Overwrite).text(S3bucket) }

Does anyone know what might cause this problem? Also, how can we solve this? We are very excited.

+4

scala hadoop distributed-computing bigdata apache-spark

cscan Aug 17 '16 at 1:22

source share

1 answer

gsamaras · Accepted Answer · 2016-08-17 01:52

I think that you get a NullPointerException SparkContext employee when it tries to access a SparkContext object that is present only in the driver and not in the workers.

coalesce () redistributes your data. When you request only one section, it will try to compress all the data in one section ^* . This can greatly affect the memory size of your application.

All in all, it is a good idea not to cut your sections to just 1.

For more information, read this: Spark NullPointerException with saveAsTextFile and this .

In case you are not sure what a section is, I explained this to myself in the memoryOverhead problem in Spark .

NullPointerException in a Spark RDD map when presented as a spark job

More articles: