Hadoop Streaming Job error in python

I have a mapreduce task written in Python. The program was successfully tested in linux env, but failed when I run it under Hadoop.

Here is the job command:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1+169.127-streaming.jar \ -input /data/omni/20110115/exp6-10122 -output /home/yan/visitorpy.out \ -mapper SessionMap.py -reducer SessionRed.py -file SessionMap.py \ -file SessionRed.py 

* .Py session mode is 755, and #!/usr/bin/env python is the top line in the * .py file. Mapper.py:

 #!/usr/bin/env python import sys for line in sys.stdin: val=line.split("\t") (visidH,visidL,sessionID)=(val[4],val[5],val[108]) print "%s%s\t%s" % (visidH,visidL,sessionID) 

Error from the log:

 java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72) at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 
+4
source share
8 answers

I have the same problem, and I was interested, because when I test my cartographer and reducer on test data, it starts. But when I run the same test suite using the hasoop map reduction, I used the same problem.

How to check your code locally:

 cat <test file> | python mapper.py | sort | python reducer.py 

In a further investigation, I found that "shebang line" is not included in my mapper.py script.

 #!/usr/bin/python 

Please add the above line as the first line of your python script and leave an empty line after that.

If you need to know more about the "shebang line", read Why do people write #! / Usr / bin / env python in the first line of a Python script?

+4
source

You can find python error messages (e.g. traceback) and other things written by your script on stderr in the hadoop web interface. It is a little hidden, but you will find it in the link provided by streaming. You click on "Map" or "Zoom Out", then click on any task, and then in the "Task Logs" column on "All"

+2
source

Finally, I fixed the error, and here are the lessons that I learned. 1) The source code does not have error handling for bad data. I did not notice this problem when I tested the code on a small data set. 2) To handle empty fields / variables, I found that in Python it is a little difficult to check No and empty lines. Personally, I like the len (strVar) function, which is easy to read and efficient. 3) The hadoop command is correct in this case. Somehow * * .py with 644 mode can be successfully launched in the environment that I am using.

+1
source

Hadoop Streaming - Hadoop 1.0.x

I had the same "Broken pipe" problem. In my gearbox, the problem was the "break" statement. So, everything went well until the "break". After that, the working gearbox stopped working with the Broken Pipe error. In addition, another gearbox began to work with the same fate with the previous one. This circle went on and on.

If I understood correctly when the reducer starts reading with stdin (this was my case, in the for loop), then it should read everything. You cannot “smash” this operation even if you close stdin (os.close (0), as I tried).

+1
source

I had the same problem as today with Hadoop 1.0.1. Fortunately, I solved this:

hasoop ... -mapper $ cwd / mapper.py -reducer $ cwd / reducer.py ...

(My Python scripts were in the current directory). It seems that absolute paths are now needed.

Best!

+1
source

A dirty entry can cause this problem.

Try using try {} to avoid this.

 #!/usr/bin/env python import sys for line in sys.stdin: try: val=line.split("\t") (visidH,visidL,sessionID)=(val[4],val[5],val[108]) print "%s%s\t%s" % (visidH,visidL,sessionID) except Exception as e: pass 
+1
source

Python + Hadoop is difficult in some of the details that should not be. Take a look here .

Try enclosing your input path in double quotation marks. (-input "/ data / omni / 20110115 / exp6-10122")

0
source

One possible solution includes "python", that is:

 -mapper "python ./mapper.py" -reducer "python ./reducer.py" 
0
source

All Articles