I'm having trouble reading the ORC file directly from the Spark shell. Note: running Hadoop 1.2 and Spark 1.2 using the pyspark shell, you can use the spark shell (scala runs).
I used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])
I get an error, usually the wrong syntax. Once, the code seemed to work, I used only one of the three arguments passed to hasoopFile, but when I tried to use
inputRead.first()
the conclusion was RDD [nothing, nothing]. I do not know if this is due to the fact that the variable inputRead was not created as an RDD or was not created at all.
I appreciate any help!