Reading ORC Files Directly from Spark Shell

Question

Reading ORC Files Directly from Spark Shell

I'm having trouble reading the ORC file directly from the Spark shell. Note: running Hadoop 1.2 and Spark 1.2 using the pyspark shell, you can use the spark shell (scala runs).

I used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])

I get an error, usually the wrong syntax. Once, the code seemed to work, I used only one of the three arguments passed to hasoopFile, but when I tried to use

inputRead.first()

the conclusion was RDD [nothing, nothing]. I do not know if this is due to the fact that the variable inputRead was not created as an RDD or was not created at all.

I appreciate any help!

+4

scala hadoop hive apache-spark pyspark

mslick3 11 . '15 22:27

2

Sudheer Palyam · Answer 1 · 2016-05-20T08:40:42+0000

Spark 1.5 ORC :

val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show

UserszrKs · Answer 2 · 2017-02-14T06:24:59+0000

val df = sqlContext.read.format("orc").load("hdfs://localhost:8020/user/aks/input1/*","hdfs://localhost:8020/aks/input2/*/part-r-*.orc")

Reading ORC Files Directly from Spark Shell

More articles: