Computational correlation PySpark

I want to use the pyspark.mllib.stat.Statistics.corr function to calculate the correlation between two columns of a pyspark.sql.dataframe.DataFrame object. The corr function corr awaiting the adoption of rdd Vectors . How to translate df['some_name'] rdd into rdd object?

+7
source share
2 answers

This should not be necessary. For numerical calculation, you can directly calculate the correlation using DataFrameStatFunctions.corr :

 df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"]) df1.stat.corr("x", "y") # -1.0 

otherwise you can use VectorAssembler :

 from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=df.columns, outputCol="features") assembler.transform(df).select("features").flatMap(lambda x: x) 
+9
source

OK I understood:

 v1 = df.flatMap(lambda x: Vectors.dense(x[col_idx_1])) v2 = df.flatMap(lambda x: Vectors.dense(x[col_idx_2])) 
+1
source

All Articles