This should not be necessary. For numerical calculation, you can directly calculate the correlation using DataFrameStatFunctions.corr :
df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"]) df1.stat.corr("x", "y")
otherwise you can use VectorAssembler :
from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=df.columns, outputCol="features") assembler.transform(df).select("features").flatMap(lambda x: x)
source share