Hadoop JobClient: Error reading job

I am trying to process 40 gigabytes of Wikipedia articles on my cluster. The problem is the following recurring error message:

13/04/27 17:11:52 INFO mapred.JobClient: Task Id : attempt_201304271659_0003_m_000046_0, Status : FAILED Too many fetch-failures 13/04/27 17:11:52 WARN mapred.JobClient: Error reading task outputhttp://ubuntu:50060/tasklog?plaintext=true&attemptid=attempt_201304271659_0003_m_000046_0&filter=stdout 

When I run the same MapReduce program in a smaller part of the Wikipedia articles, and not in the full set, it works fine, and I get all the desired results. Based on this, I thought maybe this is a memory problem. I cleared all user logs (as indicated in a similar post ) and tried again. No use. I disabled replication to 1 and added a few more nodes. Still useless.

The cluster summary is as follows:

  • Configured Capacity: 205.76 GB
  • Used DFS: 40.39 GB
  • Unsupported DFS: 44.66 GB
  • Remaining DFS: 120.7 GB
  • Used DFS%: 19.63%
  • DFS Remaining%: 58.66%
  • Live Nodes: 12
  • Dead Knots: 0
  • Divided Nodes: 0
  • Number of Replicated Blocks: 0

Each node runs on Ubuntu 12.04 LTS

Any help is appreciated.

EDIT

JobTracker Magazine: http://txtup.co/gtBaY

TaskTracker Magazine: http://txtup.co/wEZ5l

+4
source share
3 answers

Retrieval errors often occur due to DNS issues. Check each datanode to make sure that the host name and ip address that it is configured with DNS resolution is resolved for that host name.

You can do this by visiting each node in your cluster and running hostname and ifconfig , and pay attention to the host name and the returned ip address. Say for example, this returns the following:

 namenode.foo.com 10.1.1.100 datanode1.foo.com 10.1.1.1 datanode2.foo.com 10.1.1.2 datanode3.foo.com 10.1.1.3 

Then repeat all node and nslookup all host names returned from other nodes. Make sure the returned ip address matches the one found from ifconfig . For example, if on datanode1.foo.com you should do the following:

 nslookup namenode.foo.com nslookup datanode2.foo.com nslookup datanode3.foo.com 

and you should return:

10.1.1.100 10.1.1.2 10.1.1.3

When you perform a task on a subset of the data, you probably did not have enough partitions to run the task in the incorrectly configured datanode (s).

+1
source

I had a similar problem and was able to find a solution. The problem is how hasoop deals with smaller files. In my case, I had about 150 text files that added up to 10 MB. Due to how the files are "divided" into blocks, the system quickly ran out of memory. To solve this problem, you need to โ€œfill inโ€ the blocks and arrange the new files so that they are well distributed among the blocks. Hadoop allows you to โ€œarchiveโ€ small files so that they are correctly distributed into blocks.

hadoop -archiveName files.har -p / user / hadoop / data / user / hadoop / archive

In this case, I created an archive called files.har from the / user / hadoop / data folder and saved it in the / user / hadoop / archive folder. After that, I rebalance the distribution of clusters using start-balancer.sh.

Now when I run the wordcount example again, the files.har files work fine.

Hope this helps.

Best

Enrique

0
source

I had exactly the same problem with Hadoop 1.2.1 in an 8-node cluster. The problem was in the / etc / hosts file. I deleted all entries containing "127.0.0.1 localhost". Instead of "127.0.0.1 localhost", you should map your IP address to your host name (for example, "10.15.3.35 myhost"). Note that this is necessary for all nodes in the cluster. So, in a cluster with two nodes, the master / etc / hosts should contain "10.15.3.36 masters_hostname", and slave / etc / hosts should contain "10.15.3.37 slave1_hostname". After these changes, it would be nice to restart the cluster. Also look at some of the major Hadoop troubleshooting issues: Hadoop Troubleshooting

0
source

All Articles