Spark 1.6 kafka streaming on dataproc py4j error

I get the following error:

Py4JError (u'An error when calling o73.createDirectStreamWithoutMessageHandler. Trace: \ npy4j.Py4JException: createDirectStreamWithoutMessageHandler ([class org.apache.spark.streaming.api.java.JavaStreamingContext.java class java.ash HashSet, class java.util.HashMap]) does not exist \ n \ tat py4j.reflection.ReflectionEngine.getMethod (ReflectionEngine.javahaps35) \ n \ tat py4j.reflection.ReflectionEngine.getMethod (ReflectionEngine.java: 344) \ n \ tat py4j.Gateway.invoke (Gateway.java:252) \ n \ tat py4j.commands.AbstractCommand.invokeMethod (AbstractCommand.java:133) \ n \ tat py4j.commands.CallCommand.execute (CallCommand.java: 79) \ n \ tat py4j.GatewayConnection.run (GatewayConnection.java:209) \ n \ tat java.lang.Thread.run (Thread.java:745) \ n \ n ',)

I use spark-streaming-kafka-assembly_2.10-1.6.0.jar (which is present in the / usr / lib / hadoop / lib / folder on all my nodes + wizard)

(EDIT) Actual error: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.util.Apps.crossPlatformify (Ljava / lang / String;) Ljava / lang / String;

This was due to the wrong version of hadoop. Therefore, the spark must be compiled with the correct hadoop version:

mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 -DskipTests clean package

This will create a jar in the external / kafka-assembly / target folder.

+4
source share
1 answer

Using image version 1, I successfully run the pyspark stream / wordcount kafka example

"ad-kafka-inst" - kafka "test".

  • :

    $ gcloud dataproc jobs submit pyspark --cluster ad-kafka2 --properties spark.jars.packages=org.apache.spark:spark-streaming-kafka_2.10:1.6.0 ./kafka_wordcount.py ad-kafka-inst:2181 test 
    
  • kafka:

    • / spark-1.6.0.tgz
    • :

      $ mvn -Phadoop-2.6 -Dhadoop.version=2.7.2 package
      
    • spark-streaming-kafka-assembly_2.10-1.6.0.jar GCS (, MYBUCKET).
    • GCS (, gs://MYBUCKET/install_spark_kafka.sh):

      $ #!/bin/bash
      
      gsutil cp gs://MY_BUCKET/spark-streaming-kafka-assembly_2.10-1.6.0.jar /usr/lib/hadoop/lib/
      chmod 755 /usr/lib/hadoop/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar 
      
    • :

      $ gcloud dataproc clusters create ad-kafka-init --initialization-actions gs://MYBUCKET/install_spark_kafka.sh
      
    • :

      $ gcloud dataproc jobs submit pyspark --cluster ad-kafka-init ./kafka_wordcount.py ad-kafka-inst:2181 test
      
+1

All Articles