Nonlinear (DAG) ML piping in Apache Spark

I installed a simple Spark-ML application where I have a pipeline of independent transformers that add columns to the dataframe of the source data. Since the transformers do not look at each other's output, I hoped that I could run them in parallel in a non-linear (DAG) conveyor. All I could find about this feature was a paragraph from the Spark ML-Guide :

Non-linear pipelines can be created as long as the data flow graph forms a straight acyclic graph (DAG). This graph is currently implicitly determined based on the input and output column names of each step (usually specified as parameters). If the shape of the DAG pipeline, then the steps should be indicated in a topological order.

My understanding of the paragraph is that if I set the parameters inputCol (s), outputCol for each transformer and set the steps in the topological order, when I create the pipeline, then the engine will use this information to build the DAG execution system. DAG steps can start after they are entered.

Some questions about this:

  • Do I understand correctly?
  • What happens if I don’t specify an output column for one of the stages / transformers (for example, the stage only filters some rows)? Will they assume that for the purpose of creating a DAG, the scene changes all the columns, so all subsequent steps must wait for it?
  • Similarly, what happens if for one of the steps I do not specify inputCol (s)? Will the stage wait for the completion of all previous stages?
  • It seems I can specify multiple input columns, but only one output column. What happens if the transformer adds two columns to the data framework (Spark himself has no problem with this)? Is there any way to report this to the DAG creation engine?
+7
apache-spark apache-spark-mllib apache-spark-ml
source share
1 answer

Do I understand correctly?

Not really. Since the steps are provided in topological order, all you have to do to cross the graph in the correct order is apply PipelineStages from left to right. And this is exactly what happens when you call PipelineTransform .

The sequence of steps takes place twice:

Similarly, what happens if for one of the steps I do not specify inputCol (s)?

Pretty much nothing interesting. Since the steps are applied sequentially, and the only check of the circuit is applied by this Transformer using the transformSchema method before the actual transformations begin, it will be processed like any other step.

What happens if the transformer adds two columns to the dataframe

Same as above. As long as it generates the correct input scheme for the next steps, it is no different from any other Transformer .

Transformers

don't look at each other's output. I was hoping I could run them in parallel.

Theoretically, you can try to create a custom composite transformer that encapsulates several different transformations, but the only part that can be done independently and benefit from this type of operation is setting up the model. At the end of the day, you should return one transformed DataFrame , which can be used in the downstream stages, and the actual transformations are most likely planned as a single-scan of the data anyway.

The question remains if it is really worth the effort. Although several tasks can be performed simultaneously, it provides only some advantages if the amount of available resources is relatively large compared to the amount of work required to process one task. Usually some low-level control is required (number of partitions, number of shuffled partitions), which is not the strongest version of Spark SQL.

+4
source share

All Articles