Exit VectorAssembler for DenseVector only?

Something is very annoying with the VectorAssembler function. I am currently converting a set of columns to a single column of vectors, and then use the StandardScaler function to apply scaling to the included functions. However, it seems that SPARK for memory reasons, decides whether to use DenseVector or SparseVector to represent each line of functions. But, when you need to use StandardScaler, the SparseVector (s) input is invalid, only DenseVectors are allowed. Does anyone know about this?

Edit: Instead, I decided to use the UDF function, which is a sparse vector into a dense vector. The view is stupid, but it works.

+4
source share
2 answers

You are right to VectorAssemblerchoose a dense or sparse output format based on what uses less memory.

You do not need UDF to convert from SparseVectorto DenseVector; just use toArray()method :

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

Also StandardScaleraccepts SparseVectorif you did not install withMean=Trueat creation. If you need to discard the value, you must subtract the (presumably nonzero) number from all components, so the sparse vector will no longer be sparse.

+2
source

you can convert it to a dense vector after VectorAssembler converts it to a sparse vector.

Here is what I did

Create class case DectorVector

case class vct(features:Vector)

val new_df = df.select("sparse vector column").map(x => { vct(x.getAs[org.apache.spark.mllib.linalg.SparseVector](1).toDense)}).toDF()

0

All Articles