Using Pig and Python

Sorry if this question is poorly worded: I'm embarking on a large-scale machine learning project, and I don't like Java programming. I like to write programs in Python. I heard well about the Pig. I was wondering if anyone could explain to me how a useful Pig is combined with Python for mathematically related work. Also, if I write "python streaming code", does Jython appear in the picture? Is it more effective if it gets into the picture?

thanks

PS: For several reasons, I would not want to use the Mahout code as is. Perhaps I want to use some of my data structures: it would be useful to know if this can be done.

+4
source share
3 answers

Another use case for Python with Hadoop is PyCascading . Instead of writing only UDF in Python / Jython or using streaming, you can shift the whole work together in Python using Python functions as "UDF" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce structure for stream operations is Cascading . Unions, groupings, etc. They work similarly to the Pig in spirit, so there is no surprise if you already know the Pig.

An example of word counting is as follows:

@map(produces=['word']) def split_words(tuple): # This is called for each line of text for word in tuple.get(1).split(): yield [word] def main(): flow = Flow() input = flow.source(Hfs(TextLine(), 'input.txt')) output = flow.tsv_sink('output') # This is the processing pipeline input | split_words | GroupBy('word') | Count() | output flow.run() 
+4
source

When you use streaming in pig , it doesn't matter which language you use ... all it does is execute a command in a shell (for example, via bash). You can use Python, just like you can use grep or a C program.

Now you can define Pig UDF in Python natively . These UDFs will be called through Jython when they are executed.

+2
source

The Programming Pig reference discusses the use of UDF. The book is required in general. In a recent project, we used Python UDF and sometimes had problems with Floats vs. inconsistencies Doubles, so be warned. My impression is that Python UDF support may not be as reliable as Java UDF support, but overall it works very well.

0
source

All Articles