Hadoop job failed, resource manager does not recognize AttemptID

Question

Hadoop job failed, resource manager does not recognize AttemptID

I am trying to merge some data into an Oozie workflow. However, the aggregation step is not performed.

I found two points of interest in the logs: the first is an error (?) That seems to be repeated repeatedly:

After the container completes, it is killed, but exits with a non-zero exit code of 143.

It ends with:

2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312 2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0

And then when it is killed by the application wizard:

 2015-05-04 15:35:13,831 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1430730089455_0009_m_000048_0: Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

The second point of interest is the actual error, which completely disrupts the work, this happens in the reduction phase, not sure if these two are related:

 2015-05-04 15:35:28,767 INFO [IPC Server handler 20 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257 2015-05-04 15:35:29,930 INFO [IPC Server handler 10 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000052_0 is : 0.19511986 2015-05-04 15:35:31,549 INFO [IPC Server handler 1 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000050_0 is : 0.5324404 2015-05-04 15:35:31,771 INFO [IPC Server handler 28 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257 2015-05-04 15:35:31,890 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009 at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:675) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244) at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) at java.lang.Thread.run(Thread.java:695) Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy36.allocate(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188) at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667) ... 3 more Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache. at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy35.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 11 more

After that, the oozie: launcher job and the job that got the error just sit there with STATE: accepted, FINALSTATUS: undefined and TRACKING UI: unrecognized.

Does anyone know what causes this error, and how can I fix it? The same workflow worked before, and I could not say that I changed something between them ...

+4

mapreduce hadoop oozie

h2b May 04 '15 at 14:44

source share

2 answers

I also ran into a similar problem. In my case, the task was completed successfully. But there was a lot of time to complete, and there were a very large number of magazines on jobtracker, most of the journal entries were from the progress report, for example the following:

 2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312 2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0

When debugging my code, I found that the distributed cache is read through the map function of the Mapper class. After moving the logic to read the dist-cache data from the map function to configure the method of the Mapper class, my problem is solved.

We need to use the configure method for operations that are mainly related to configurations or code blocks that we need to process only once before calling the map function.

0

java_dev Oct 7 '16 at 8:05

source share

h2b · Accepted Answer · 2015-05-22T08:34:32+0000

Just in case, someone else is deceiving this error: it seemed that it was caused due to chaos ending in disk space ... A rather mysterious error for something as simple as that. I thought that ~ 90 GB would be enough to work with my 30 GB dataset, I was wrong.

Hadoop job failed, resource manager does not recognize AttemptID

More articles: