Spark for YARN Resource Manager: Relationship between YARN Containers and Spark Performers

I am new to Spark on YARN and do not understand the relationship between YARN Containers and Spark Executors . I tried the following configuration based on the results of the yarn-utils.py script, which can be used to find the optimal cluster configuration.

Hadoop cluster (HDP 2.4) I am working on:

  • 1 Node Wizard:
    • CPU: 2 processors with 6 cores each = 12 cores
    • RAM: 64 GB
    • SSD: 2 x 512 GB
  • 5 Slave Nodes:
    • CPU: 2 processors with 6 cores each = 12 cores
    • RAM: 64 GB
    • HDD: 4 x 3 TB = 12 TB
  • HBase installed (this is one of the options for the script below)

So, I ran python yarn-utils.py -c 12 -m 64 -d 4 -k True (c = core, m = memory, d = hdds, k = hbase-installed) and got the following result:

  Using cores=12 memory=64GB disks=4 hbase=True Profile: cores=12 memory=49152MB reserved=16GB usableMem=48GB disks=4 Num Container=8 Container Ram=6144MB Used Ram=48GB Unused Ram=16GB yarn.scheduler.minimum-allocation-mb=6144 yarn.scheduler.maximum-allocation-mb=49152 yarn.nodemanager.resource.memory-mb=49152 mapreduce.map.memory.mb=6144 mapreduce.map.java.opts=-Xmx4915m mapreduce.reduce.memory.mb=6144 mapreduce.reduce.java.opts=-Xmx4915m yarn.app.mapreduce.am.resource.mb=6144 yarn.app.mapreduce.am.command-opts=-Xmx4915m mapreduce.task.io.sort.mb=2457 

I made these settings through the Ambari interface and restarted the cluster. The values ​​also correspond roughly to what I manually calculated before.

I have problems

  • to find the optimal settings for my spark-submit script
    • options --num-executors , --executor-cores and --executor-memory .
  • to get the connection between the YARN container and Spark performers
  • to understand the hardware information in my Spark History user interface (less memory shows how I installed (when calculating the total memory by multiplying by the number of node workers))
  • to understand the concept of vcores in YARN, here I have not yet found useful examples

However, I found this post. What is a container in YARN? , but this did not help, since it does not describe the attitude towards the performers.

Can someone help resolve one or more issues?

+7
containers yarn apache-spark executor hortonworks-data-platform
source share
1 answer

I will tell about my impressions here step by step:

  • The first important thing is this fact (Source: this Cloudera documentation ):

    When Spark launches on YARN, each Spark artist runs as a YARN container. [...]

  • This means that the number of containers will always be the same as the artists created by the Spark application, for example. via the parameter --num-executors in spark-submit.

  • Installed by yarn.scheduler.minimum-allocation-mb each container always allocates at least this amount of memory. This means that if the parameter --executor-memory set, for example, only 1g , but yarn.scheduler.minimum-allocation-mb is, for example, 6g , the container is much larger than required by the Spark application.

  • Otherwise, if the --executor-memory parameter is set to a value that exceeds the value of yarn.scheduler.minimum-allocation-mb , for example. 12g , the container will allocate more memory dynamically, but only if the requested amount of memory is less than or equal to yarn.scheduler.maximum-allocation-mb .

  • The yarn.nodemanager.resource.memory-mb value determines how much memory can be allocated by the sum of all the containers of the same host !

=> Thus, setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers, for example. for smaller performers ( otherwise it would be a waste of memory ).

=> Setting yarn.scheduler.maximum-allocation-mb to the maximum value (for example, equal to yarn.nodemanager.resource.memory-mb ) allows you to define larger executors (if necessary, more memory is allocated, for example, with the --executor-memory parameter )

+10
source share

All Articles