Twitter is open source. The elephant bird has many such downloaders: https://github.com/kevinweil/elephant-bird
You can use LzoProtobufB64LinePigLoader and LzoProtobufBlockPigLoader. https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/load
To use it, you just need to do:
define ProtoLoader com.twitter.elephantbird.pig.load.LzoProtobufB64LineLoader('your.proto.class.name'); a = load '/your/file' using ProtoLoader; b = foreach a generate field1, field2;
After loading, it will automatically be transferred to the feeders with the correct scheme.
However, they suggest that you write your data to a serialized protobuffer and compress lzo.
They have relevant authors, as well as in the package com.twitter.elephantbird.pig.store. If the data format is slightly different, you can adapt your code to your custom loader.
source share