For the project of my project , I want to analyze about 2 TB Protobuf . I want to use these objects in Pig Script through the elephant bird library . However, it’s not entirely clear to me how to write a file to HDFS so that it can be used by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts thereof):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
. ProtobufPigLoader, hadoop-lzo (. ) ProtobufPigLoader . , , , DUMP raw_data; Unable to open iterator for alias raw_data ILLUSTRATE raw_data; No (valid) input data found!.
, ProtobufBlockWriter ProtobufPigLoader. ? HDFS, ProtobufPigLoader.
: ? Hadoop, Pig? , - ( Protobuf).
- JSON, . , 2 3 ( , , Base64).
- , ( , now), .
:
- protobuf,
- protobuf , .
DESCRIBE .