I am writing a spark job using Python. However, I need to read a whole bunch of avro files.
This is the closest solution I found in the Spark examples folder. However, you need to send this script to python using spark-submit. In the spark-submit command line, you can specify the driver class, in which case your whole avrokey class, avrovalue, will be located.
avro_rdd = sc.newAPIHadoopFile( path, "org.apache.avro.mapreduce.AvroKeyInputFormat", "org.apache.avro.mapred.AvroKey", "org.apache.hadoop.io.NullWritable", keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter", conf=conf)
In my case, I need to run everything in a Python script, I tried to create an environment variable that includes a jar file, the Cross Cross Python method will add jar to the path, but obviously it is not, it gives me an unexpected error class.
os.environ['SPARK_SUBMIT_CLASSPATH'] = "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar"
Can someone help me how to read avro file in one Python script?
python apache-spark pyspark avro
B.Mr.W.
source share