AppMaster yarn request for broken containers

I am running a local yarn cluster with 8 vCores and 8Gb shared memory.

The workflow is as follows:

  • YarnClient sends an application request that launches AppMaster in the container.

  • AppMaster start, creates amRMClient and nmClient, registers with RM, and then creates 4 container requests for workflows through amRMClient.addContainerRequest

Although there are enough resources, containers are not allocated (the onContainersAllocated callback function is never called). I tried to check the nodemanager and resourcemanager logs and I do not see any line related to container requests. I kept a close eye on apache docs and couldn't figure out what I was doing wrong.

For reference: AppMaster code:

@Override public void run() { Map<String, String> envs = System.getenv(); String containerIdString = envs.get(ApplicationConstants.Environment.CONTAINER_ID.toString()); if (containerIdString == null) { // container id should always be set in the env by the framework throw new IllegalArgumentException("ContainerId not set in the environment"); } ContainerId containerId = ConverterUtils.toContainerId(containerIdString); ApplicationAttemptId appAttemptID = containerId.getApplicationAttemptId(); LOG.info("Starting AppMaster Client..."); YarnAMRMCallbackHandler amHandler = new YarnAMRMCallbackHandler(allocatedYarnContainers); // TODO: get heart-beet interval from config instead of 100 default value amClient = AMRMClientAsync.createAMRMClientAsync(1000, this); amClient.init(config); amClient.start(); LOG.info("Starting AppMaster Client OK"); //YarnNMCallbackHandler nmHandler = new YarnNMCallbackHandler(); containerManager = NMClient.createNMClient(); containerManager.init(config); containerManager.start(); // Get port, ulr information. TODO: get tracking url String appMasterHostname = NetUtils.getHostname(); String appMasterTrackingUrl = "/progress"; // Register self with ResourceManager. This will start heart-beating to the RM RegisterApplicationMasterResponse response = null; LOG.info("Register AppMaster on: " + appMasterHostname + "..."); try { response = amClient.registerApplicationMaster(appMasterHostname, 0, appMasterTrackingUrl); } catch (YarnException | IOException e) { // TODO Auto-generated catch block e.printStackTrace(); return; } LOG.info("Register AppMaster OK"); // Dump out information about cluster capability as seen by the resource manager int maxMem = response.getMaximumResourceCapability().getMemory(); LOG.info("Max mem capabililty of resources in this cluster " + maxMem); int maxVCores = response.getMaximumResourceCapability().getVirtualCores(); LOG.info("Max vcores capabililty of resources in this cluster " + maxVCores); containerMemory = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_MEMORY_MB)); containerCores = Integer.parseInt(config.get(YarnConfig.YARN_CONTAINER_CPU_CORES)); // A resource ask cannot exceed the max. if (containerMemory > maxMem) { LOG.info("Container memory specified above max threshold of cluster." + " Using max value." + ", specified=" + containerMemory + ", max=" + maxMem); containerMemory = maxMem; } if (containerCores > maxVCores) { LOG.info("Container virtual cores specified above max threshold of cluster." + " Using max value." + ", specified=" + containerCores + ", max=" + maxVCores); containerCores = maxVCores; } List<Container> previousAMRunningContainers = response.getContainersFromPreviousAttempts(); LOG.info("Received " + previousAMRunningContainers.size() + " previous AM running containers on AM registration."); for (int i = 0; i < 4; ++i) { ContainerRequest containerAsk = setupContainerAskForRM(); amClient.addContainerRequest(containerAsk); // NOTHING HAPPENS HERE... LOG.info("Available resources: " + amClient.getAvailableResources().toString()); } while(completedYarnContainers != 4) { try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } } LOG.info("Done with allocation!"); } @Override public void onContainersAllocated(List<Container> containers) { LOG.info("Got response from RM for container ask, allocatedCnt=" + containers.size()); for (Container container : containers) { LOG.info("Allocated yarn container with id: {}" + container.getId()); allocatedYarnContainers.push(container); // TODO: Launch the container in a thread } } @Override public void onError(Throwable error) { LOG.error(error.getMessage()); } @Override public float getProgress() { return (float) completedYarnContainers / allocatedYarnContainers.size(); } 

Here's the output from jps:

 14594 NameNode 15269 DataNode 17975 Jps 14666 ResourceManager 14702 NodeManager 

And here is the AppMaster log for initialization and 4 container requests:

 23:47:09 YarnAppMaster - Starting AppMaster Client OK 23:47:09 YarnAppMaster - Register AppMaster on: andrei-mbp.local/192.168.1.4... 23:47:09 YarnAppMaster - Register AppMaster OK 23:47:09 YarnAppMaster - Max mem capabililty of resources in this cluster 2048 23:47:09 YarnAppMaster - Max vcores capabililty of resources in this cluster 2 23:47:09 YarnAppMaster - Received 0 previous AM running containers on AM registration. 23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0] 23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0> 23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0] 23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0> 23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0] 23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0> 23:47:11 YarnAppMaster - Requested container ask: Capability[<memory:512, vCores:1>]Priority[0] 23:47:11 YarnAppMaster - Available resources: <memory:7680, vCores:0> 23:47:11 YarnAppMaster - Progress indicator should not be negative 

Thanks in advance.

+5
source share
2 answers

I suspect that the problem comes precisely from negative progress:

  23:47:11 YarnAppMaster - Progress indicator should not be negative 

Note that since you are using AMRMAsyncClient, requests are not executed immediately when you call addContainerRequest. In fact, there is a heartbeat function that runs periodically, and it is in this function that allocate is called, and pending requests will be made. The progress value used by this function initially starts at 0 , but is updated with the value returned by your handler after receiving a response from the acquisition.

The first acquisition is supposed to be performed immediately after the register, so the getProgress function should call and update the existing progress. Be that as it may, your progress will be updated to NaN, because at that time the highlighted Bright Containers will be empty and complete. Bright Containers will also be 0, and so your returned progress will be the result of 0/0, which is undefined. It so happened that when the next allocate checks your progress value , it will not work, because NaNs return false in all comparisons, and therefore no other distribution function will actually bind to the ResourceManager, because it exits in this first step with an exception .

Try changing the execution function as follows:

 @Override public float getProgress() { return (float) allocatedYarnContainers.size() / 4.0f; } 

(note: copied to StackOverflow for posteriority from here )

+1
source

Thanks to Alexandre Fonseca for getProgress () returning NaN to divide by zero when it called before the first allocation, which makes the ResourceManager immediately exit with an exception.

Read more about it here .

0
source

All Articles