To answer your first question about converting MapReduce jobs to DAG Tez:
Any MapReduce task can be considered as one DAG with two vertices (steps). The first vertex is the Map phase and is connected to the bottom vertex Zoom out using the Shuffle edge.
There are two ways to complete MR tasks on Tez:
- One approach is to write your own 2-step DAG using the Tez API directly. This is what is currently present in the examples.
- Secondly, you need to use MapReduce APIs and use the yarn-tez mode. In this case, there is a layer that intercepts the representation of the MR job, and instead of using MR, it transforms the MR job into a two-stage Tez DAG and executes the DAG during Tez execution.
For questions related to the processing of data that you have:
The user provides logic for understanding the data read and breaking it up. Tez then takes each data split and takes the responsibility of assigning a split or set of splits to a given task.
The Tez structure then controls the generation and movement of data, that is, where it is possible to generate data between intermediate steps and how to move data between two vertices / stages. However, it does not control the basic content structure, structure, partitioning, or serialization logic provided by custom plugins.
The above is simply a high level with added complexity. You will receive more detailed answers by posting specific questions on the development list ( http://tez.apache.org/mail-lists.html )
hadoop_user
source share