How to write dataframe (obtained from hive table) in hasoop SequenceFile and RCFile?

I can write it in

  • ORC
  • PARQUET

    and

  • TEXTFILE

  • AVRO

using additional dependencies on databricks.

  <dependency> <groupId>com.databricks</groupId> <artifactId>spark-csv_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>2.0.1</version> </dependency> 

Code example:

  SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame df = hc.table(hiveTableName); df.printSchema(); DataFrameWriter writer = df.repartition(1).write(); if ("ORC".equalsIgnoreCase(hdfsFileFormat)) { writer.orc(outputHdfsFile); } else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) { writer.parquet(outputHdfsFile); } else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile); } else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) { writer.format("com.databricks.spark.avro").save(outputHdfsFile); } 

Is there a way to write a dataframe to a hasoop SequenceFile and RCFile?

+6
source share
1 answer

You can use void saveAsObjectFile(String path) to save the RDD as a SequenceFile of serialized objects. So in your case, you need to extract the RDD from the DataFrame :

 JavaRDD<Row> rdd = df.javaRDD; rdd.saveAsObjectFile(outputHdfsFile); 
+2
source

All Articles