Amazon Elastic MapReduce - SIGTERM

I have an EMR (Python) streaming job that works fine (for example, 10 machines handle 200 inputs). However, when I run it against large data sets (12 machines process a total of 6000 inputs, approximately 20 seconds per input), after 2.5 hours of crunching I get the following error:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) 

If I read this correctly, the subprocess failed with code 143 because someone sent a SIGTERM signal to a streaming job.

Do I understand correctly? If so: when will the EMR infrastructure send SIGTERM?

+6
source share
2 answers

I understood what was happening, so there is information here if someone else is faced with similar problems.

The key for me was to watch jobtracker magazines. They are located in your folder / task folder on S3, under:

 <logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log. 

There were several lines of the following form:

 2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0: Task attempt_201208210612_0001_m_000015_0 failed to report status for 601 seconds. Killing! 

So, my code disconnected and it was killed (it exited in 10 minutes of waiting for the task). For 10 minutes I did not do any I / O, which, of course, was not expected (I would usually do I / O every 20 seconds).

Then I discovered this article:

http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code

β€œIn one of our science projects, we have several Hadoop Streaming work orders that work on ruby ​​and rely on libxml to parse documents. This creates the perfect storm of badness - the network is full of really bad html and libxml sometimes goes in endless cycles or direct segfaults. In some documents it always works. "

He nailed it. I have to experience one of these situations, "libxml going in an endless loop" (I use libxml heavily - only with Python, not Ruby).

The last step for me was to start skip mode (instructions here: Setting hadoop parameters using boto? ).

+10
source

I came across this exit from Amazon EMR ("the subprocess failed with code 143"). My streaming work used PHP curl to send data to a server that did not have MapReduce job servers in its security group. Therefore, the gearbox was timed out and was killed. Ideally, I would like to add my tasks to the same security group, but I decided to just add the URL security token setting for my API.

+1
source

Source: https://habr.com/ru/post/922942/


All Articles