How to find the right part between types of howop instances

I am trying to find out how many instances of MASTER, CORE, TASK are optimal for my tasks. I could not find a single textbook explaining how to understand this.

  • How do I know if I need more than 1 primary instance? What “symptoms" will I see in the EMR console in metrics that tell me that I need more than one core? Until now, when I tried the same work with 1 * core + 7 * task instances, it worked in much the same way as on the 8 * core, but for me this does not make much sense. Or is it possible that my work is so limited by the processor that IO is so insignificant? (I have a display-only job that parses Apache log files into a csv file)

  • Is there such a thing to have more than one instance of a wizard? If so, when is this necessary? Interesting, because my node wizard pretty much just waits for other nodes to complete the job (0% CPU) for 95% of the time.

  • Can the master and core node be identical? I can only have a cluster, only 1 and only node does everything. It seems like it would be logical to have a cluster with 1 node, which is the main and main, and the rest are task nodes, but it seems impossible to configure it using EMR. Why is this?

+6
source share
1 answer

The master instance acts as a manager and coordinates everything that happens in the entire cluster. Thus, it must exist in every workflow that you run, but just one instance is all you need. If you do not deploy a single node cluster (in this case, the main instance is the only node), it does not do any heavy lifting relative to the actual MapReducing, so the instance does not have to be a powerful machine.

The number of basic instances you need depends on the work and how quickly you want to process it, so there is no single correct answer. It’s good that you can change the size of the group instance of the kernel / task, so if you think your work is slow, you can add more instances to the running process.

One of the important differences between groups of kernel instances and task groups is that the main instances store the actual data in HDFS, while the task instances do not work. In turn, you can only increase the group of primary instances (since deleting running instances will lead to data loss in these instances). On the other hand, you can increase and decrease the group of task instances by adding or removing task instances.

Thus, these two types of instances can be used to tune the processing power of your work. Typically, you use indemand instances for core instances because they must work all the time and cannot be lost, and you use instances instances for task instances because losing instances of tasks does not kill all the work (for example, tasks not completed by instances tasks will be re-launched in primary instances). This is one way to work inexpensively with a large cluster using selective instances.

A general description of each type of instance is available here:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html

In addition, this video may be useful for efficient use of EMR:

https://www.youtube.com/watch?v=a5D_bs7E3uc

+1
source

All Articles