How does wordCount mapReduce jobs run on hoodoop clusters with apache tez?

As the github tez page shows , tez is very simple and has only two components in its composition:

  • Mechanism for working with data processing and

  • A wizard for a data processing application, where-to-can combine the data processing tasks described above into a DAG task

Well, my first question is: how do existing mapreduce jobs like wordcount that exist in tez-examples.jar convert to task-DAG? Where? or are they not ...?

and my second and more important question concerns this part:

Each "task" in the thesis has the following:

  • Input to use key / value pairs from.
  • A processor for processing them.
  • Output for collecting processed key / value pairs.

Who is responsible for sharing input between tez-tasks? Is it the code that the user provides, or is it yarn (resource manager) or even thes itself?

The question will be the same for the output phase. thanks in advance

+7
mapreduce hadoop yarn apache-tez
source share
1 answer

To answer your first question about converting MapReduce jobs to DAG Tez:

Any MapReduce task can be considered as one DAG with two vertices (steps). The first vertex is the Map phase and is connected to the bottom vertex Zoom out using the Shuffle edge.

There are two ways to complete MR tasks on Tez:

  • One approach is to write your own 2-step DAG using the Tez API directly. This is what is currently present in the examples.
  • Secondly, you need to use MapReduce APIs and use the yarn-tez mode. In this case, there is a layer that intercepts the representation of the MR job, and instead of using MR, it transforms the MR job into a two-stage Tez DAG and executes the DAG during Tez execution.

For questions related to the processing of data that you have:

The user provides logic for understanding the data read and breaking it up. Tez then takes each data split and takes the responsibility of assigning a split or set of splits to a given task.

The Tez structure then controls the generation and movement of data, that is, where it is possible to generate data between intermediate steps and how to move data between two vertices / stages. However, it does not control the basic content structure, structure, partitioning, or serialization logic provided by custom plugins.

The above is simply a high level with added complexity. You will receive more detailed answers by posting specific questions on the development list ( http://tez.apache.org/mail-lists.html )

+3
source share

All Articles