Exit VectorAssembler for DenseVector only?

Question

Exit VectorAssembler for DenseVector only?

Something is very annoying with the VectorAssembler function. I am currently converting a set of columns to a single column of vectors, and then use the StandardScaler function to apply scaling to the included functions. However, it seems that SPARK for memory reasons, decides whether to use DenseVector or SparseVector to represent each line of functions. But, when you need to use StandardScaler, the SparseVector (s) input is invalid, only DenseVectors are allowed. Does anyone know about this?

Edit: Instead, I decided to use the UDF function, which is a sparse vector into a dense vector. The view is stupid, but it works.

+4

apache-spark pyspark

ml_0x Mar 07 '16 at 12:55

source share

2 answers

max · Answer 1 · 2016-07-26T17:28:02+0000

You are right to VectorAssemblerchoose a dense or sparse output format based on what uses less memory.

You do not need UDF to convert from SparseVectorto DenseVector; just use toArray()method :

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

Also StandardScaleraccepts SparseVectorif you did not install withMean=Trueat creation. If you need to discard the value, you must subtract the (presumably nonzero) number from all components, so the sparse vector will no longer be sparse.

Fan L. · Answer 2 · 2017-08-14T20:54:23+0000

you can convert it to a dense vector after VectorAssembler converts it to a sparse vector.

Here is what I did

Create class case DectorVector

case class vct(features:Vector)

val new_df = df.select("sparse vector column").map(x => { vct(x.getAs[org.apache.spark.mllib.linalg.SparseVector](1).toDense)}).toDF()

Exit VectorAssembler for DenseVector only?

More articles: