I understood what was happening, so there is information here if someone else is faced with similar problems.
The key for me was to watch jobtracker magazines. They are located in your folder / task folder on S3, under:
<logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log.
There were several lines of the following form:
2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0: Task attempt_201208210612_0001_m_000015_0 failed to report status for 601 seconds. Killing!
So, my code disconnected and it was killed (it exited in 10 minutes of waiting for the task). For 10 minutes I did not do any I / O, which, of course, was not expected (I would usually do I / O every 20 seconds).
Then I discovered this article:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
βIn one of our science projects, we have several Hadoop Streaming work orders that work on ruby ββand rely on libxml to parse documents. This creates the perfect storm of badness - the network is full of really bad html and libxml sometimes goes in endless cycles or direct segfaults. In some documents it always works. "
He nailed it. I have to experience one of these situations, "libxml going in an endless loop" (I use libxml heavily - only with Python, not Ruby).
The last step for me was to start skip mode (instructions here: Setting hadoop parameters using boto? ).
slavi source share