Another use case for Python with Hadoop is PyCascading . Instead of writing only UDF in Python / Jython or using streaming, you can shift the whole work together in Python using Python functions as "UDF" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce structure for stream operations is Cascading . Unions, groupings, etc. They work similarly to the Pig in spirit, so there is no surprise if you already know the Pig.
An example of word counting is as follows:
@map(produces=['word']) def split_words(tuple): # This is called for each line of text for word in tuple.get(1).split(): yield [word] def main(): flow = Flow() input = flow.source(Hfs(TextLine(), 'input.txt')) output = flow.tsv_sink('output') # This is the processing pipeline input | split_words | GroupBy('word') | Count() | output flow.run()
source share