Computational correlation PySpark

Question

Computational correlation PySpark

I want to use the pyspark.mllib.stat.Statistics.corr function to calculate the correlation between two columns of a pyspark.sql.dataframe.DataFrame object. The corr function corr awaiting the adoption of rdd Vectors . How to translate df['some_name'] rdd into rdd object?

+7

python apache-spark pyspark apache-spark-sql pyspark-sql apache-spark-mllib

Vjune Jun 03 '16 at 16:06

source share

2 answers

OK I understood:

 v1 = df.flatMap(lambda x: Vectors.dense(x[col_idx_1])) v2 = df.flatMap(lambda x: Vectors.dense(x[col_idx_2]))

+1

Vjune Jun 03 '16 at 16:21

source share

zero323 · Accepted Answer · 2016-06-03T16:19:35+0000

This should not be necessary. For numerical calculation, you can directly calculate the correlation using DataFrameStatFunctions.corr :

 df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"]) df1.stat.corr("x", "y") # -1.0

otherwise you can use VectorAssembler :

 from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=df.columns, outputCol="features") assembler.transform(df).select("features").flatMap(lambda x: x)

Computational correlation PySpark

More articles: