You are right to VectorAssemblerchoose a dense or sparse output format based on what uses less memory.
You do not need UDF to convert from SparseVectorto DenseVector; just use toArray()method :
from pyspark.ml.linalg import SparseVector, DenseVector
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())
Also StandardScaleraccepts SparseVectorif you did not install withMean=Trueat creation. If you need to discard the value, you must subtract the (presumably nonzero) number from all components, so the sparse vector will no longer be sparse.
source
share