Write the data that can be read by Elephant Bird's ProtobufPigLoader

For the project of my project , I want to analyze about 2 TB Protobuf . I want to use these objects in Pig Script through the elephant bird library . However, it’s not entirely clear to me how to write a file to HDFS so that it can be used by the ProtobufPigLoader class.

This is what I have:

Pig script:

  register ../fs-c/lib/*.jar // this includes the elephant bird library
  register ../fs-c/*.jar    
  raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');

Import tool (parts thereof):

def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
  val conf = new Configuration()
  val fs = FileSystem.get(filenamePath.toUri(), conf)
  val os = fs.create(filenamePath, true)
  val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
  return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()

. ProtobufPigLoader, hadoop-lzo (. ) ProtobufPigLoader . , , , DUMP raw_data; Unable to open iterator for alias raw_data ILLUSTRATE raw_data; No (valid) input data found!.

, ProtobufBlockWriter ProtobufPigLoader. ? HDFS, ProtobufPigLoader.

: ? Hadoop, Pig? , - ( Protobuf).

  • JSON, . , 2 3 ( , , Base64).
  • , ( , now), .

:

  • protobuf,
  • protobuf , . DESCRIBE .
+5

All Articles