How to write dataframe (obtained from hive table) in hasoop SequenceFile and RCFile?

Question

How to write dataframe (obtained from hive table) in hasoop SequenceFile and RCFile?

I can write it in

ORC
PARQUET
and
TEXTFILE
AVRO

using additional dependencies on databricks.

  <dependency> <groupId>com.databricks</groupId> <artifactId>spark-csv_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>2.0.1</version> </dependency>

Code example:

  SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame df = hc.table(hiveTableName); df.printSchema(); DataFrameWriter writer = df.repartition(1).write(); if ("ORC".equalsIgnoreCase(hdfsFileFormat)) { writer.orc(outputHdfsFile); } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) { writer.parquet(outputHdfsFile); } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile); } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.avro").save(outputHdfsFile); }

Is there a way to write a dataframe to a hasoop SequenceFile and RCFile?

+6

apache-spark apache-spark-sql spark-dataframe

dev ツ Oct 3 '16 at 11:28

source share

1 answer

nicoring · Accepted Answer · 2016-10-16T23:17:27+0000

You can use void saveAsObjectFile(String path) to save the RDD as a SequenceFile of serialized objects. So in your case, you need to extract the RDD from the DataFrame :

 JavaRDD<Row> rdd = df.javaRDD; rdd.saveAsObjectFile(outputHdfsFile);

How to write dataframe (obtained from hive table) in hasoop SequenceFile and RCFile?

More articles: