How to create a Spark DataFrame inside a custom PySpark ML Pipeline _transform () method?

In Spark ML Pipelines, the transformer method transform()takes Spark DataFrameand returns a DataFrame. My regular method _transform()uses a DataFrame, which went through to create an RDD before processing it. This means that the results of my algorithm must be converted back to a DataFrame before returning from _transform().

So how do I create a DataFrame from an RDD inside _transform()?

Normally I would use SparkSession.createDataFrame(). But this means passing the instance SparkSession sparkto my user Transformersomehow (or object SqlContext). And this, in turn, can create other problems , for example, when trying to use a transformer as a stage in an ML pipeline.

0
source share
1 answer

It turns out as simple as doing it inside _transform():

yourRdd.toDF(yourSchema)

The scheme is optional. I'm sorry that I cannot give a link to toDF(), but for some reason it does not seem to be included under https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD . Perhaps this is an inherited method?

SparkSession Transformer createDataFrame() . , .

0

All Articles