Do I understand correctly?
Not really. Since the steps are provided in topological order, all you have to do to cross the graph in the correct order is apply PipelineStages from left to right. And this is exactly what happens when you call PipelineTransform .
The sequence of steps takes place twice:
Similarly, what happens if for one of the steps I do not specify inputCol (s)?
Pretty much nothing interesting. Since the steps are applied sequentially, and the only check of the circuit is applied by this Transformer using the transformSchema method before the actual transformations begin, it will be processed like any other step.
What happens if the transformer adds two columns to the dataframe
Same as above. As long as it generates the correct input scheme for the next steps, it is no different from any other Transformer .
Transformersdon't look at each other's output. I was hoping I could run them in parallel.
Theoretically, you can try to create a custom composite transformer that encapsulates several different transformations, but the only part that can be done independently and benefit from this type of operation is setting up the model. At the end of the day, you should return one transformed DataFrame , which can be used in the downstream stages, and the actual transformations are most likely planned as a single-scan of the data anyway.
The question remains if it is really worth the effort. Although several tasks can be performed simultaneously, it provides only some advantages if the amount of available resources is relatively large compared to the amount of work required to process one task. Usually some low-level control is required (number of partitions, number of shuffled partitions), which is not the strongest version of Spark SQL.
zero323
source share