RDD may depend on zero or more other RDDs. For example, if you say x = y.map(...) , x will depend on y . These dependency relationships can be viewed as a graph.
You can call this graph a line graph, since it represents the output of each RDD. This is also mandatory DAG, since it is impossible to be a loop in it.
Narrow dependencies when shuffling is not required (I think map and filter ) can be minimized in one step. Stages are a unit of execution, and they are generated by the DAGScheduler from the RDD dependency graph. Stages also depend on each other. DAGScheduler builds and uses this dependency graph (which is also necessarily a DAG) for planning stages.
source share