How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

I am using Spark cluster 2.0, and I would like to convert the vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT .

 # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() modelA = lr.fit(data, {lr.regParam:0.0}) 

Mistake:

U'requirement error: column functions must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 , but actually were org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. '

Any thoughts on how I do this conversion between types of vectors.

Many thanks.

+5
machine-learning apache-spark pyspark apache-spark-mllib apache-spark-ml
source share
1 answer

In PySpark, you will need or map over RDD. Let me use the first option. First import pair:

 from pyspark.ml.linalg import VectorUDT from pyspark.sql.functions import udf 

and function:

 as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT()) 

With sample data:

 from pyspark.mllib.linalg import Vectors as MLLibVectors df = sc.parallelize([ (MLLibVectors.sparse(4, [0, 2], [1, -1]), ), (MLLibVectors.dense([1, 2, 3, 4]), ) ]).toDF(["features"]) result = df.withColumn("features", as_ml("features")) 

Result

 +--------------------+ | features| +--------------------+ |(4,[0,2],[1.0,-1.0])| | [1.0,2.0,3.0,4.0]| +--------------------+ 
+6
source share

All Articles