In Spark ML Pipelines, the transformer method transform()takes Spark DataFrameand returns a DataFrame. My regular method _transform()uses a DataFrame, which went through to create an RDD before processing it. This means that the results of my algorithm must be converted back to a DataFrame before returning from _transform().
So how do I create a DataFrame from an RDD inside _transform()?
Normally I would use SparkSession.createDataFrame(). But this means passing the instance SparkSession sparkto my user Transformersomehow (or object SqlContext). And this, in turn, can create other problems , for example, when trying to use a transformer as a stage in an ML pipeline.
snark source
share