Upload protobuf file file in pig script using loadfunc pig UDF

I have very little knowledge about the pig. I have a protobuf format data file. I need to load this file in a pig script. I need to write LoadFunc UDF to load it. function Protobufloader() .

my pig script will be

 A = LOAD 'abc_protobuf.dat' USING Protobufloader() as (name, phonenumber, email); 

All I want to know is how to get the file input stream. Once I get the stream of input files, I can analyze the data from the protobuf format in the PIG tuple format.

PS: thanks in advance

+4
source share
1 answer

Twitter is open source. The elephant bird has many such downloaders: https://github.com/kevinweil/elephant-bird

You can use LzoProtobufB64LinePigLoader and LzoProtobufBlockPigLoader. https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load

To use it, you just need to do:

 define ProtoLoader com.twitter.elephantbird.pig.load.LzoProtobufB64LineLoader('your.proto.class.name'); a = load '/your/file' using ProtoLoader; b = foreach a generate field1, field2; 

After loading, it will automatically be transferred to the feeders with the correct scheme.

However, they suggest that you write your data to a serialized protobuffer and compress lzo.

They have relevant authors, as well as in the package com.twitter.elephantbird.pig.store. If the data format is slightly different, you can adapt your code to your custom loader.

+6
source

All Articles