In PySpark, you will need or map over RDD. Let me use the first option. First import pair:
from pyspark.ml.linalg import VectorUDT from pyspark.sql.functions import udf
and function:
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
With sample data:
from pyspark.mllib.linalg import Vectors as MLLibVectors df = sc.parallelize([ (MLLibVectors.sparse(4, [0, 2], [1, -1]), ), (MLLibVectors.dense([1, 2, 3, 4]), ) ]).toDF(["features"]) result = df.withColumn("features", as_ml("features"))
Result
+--------------------+ | features| +--------------------+ |(4,[0,2],[1.0,-1.0])| | [1.0,2.0,3.0,4.0]| +--------------------+
user6910411
source share