PSA Analysis in PySpark

Question

PSA Analysis in PySpark

Looking at http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html . The examples seem to contain only Java and Scala.

Does Spark MLlib support PCA analysis for Python? If yes, please give me an example. If not, how to combine Spark with scikit-learn?

+4

python pca apache-spark apache-spark-mllib apache-spark-ml

lapolonio Aug 2 '15 at 17:01

source share

1 answer

zero323 · Answer 1 · 2015-08-02T19:40:58+0000

Spark> = 1.5.0

Although PySpark 1.5 represents distributed data structures ( pyspark.mllib.linalg.distributed ), it seems that the API is quite limited and there is no implementation of the computePrincipalComponents method.

You can use either from pyspark.ml.feature.PCA or pyspark.mllib.feature.PCA . In the first case, the expected input is a data frame with a vector column:

 from pyspark.ml.feature import PCA as PCAml from pyspark.ml.linalg import Vectors # Pre 2.0 pyspark.mllib.linalg df = sqlContext.createDataFrame([ (Vectors.dense([1, 2, 0]),), (Vectors.dense([2, 0, 1]),), (Vectors.dense([0, 1, 0]),)], ("features", )) pca = PCAml(k=2, inputCol="features", outputCol="pca") model = pca.fit(df) transformed = model.transform(df)

In Spark 2.0 or later, use pyspark.ml.linalg.Vector instead of pyspark.ml.linalg.Vector .

For the mllib version you will need RDD Vector :

 from pyspark.mllib.feature import PCA as PCAmllib rdd = sc.parallelize([ Vectors.dense([1, 2, 0]), Vectors.dense([2, 0, 1]), Vectors.dense([0, 1, 0])]) model = PCAmllib(2).fit(rdd) transformed = model.transform(rdd)

Spark & lt; 1.5.0

PySpark & lt; = 1.4.1 does not support distributed data structures, however, there is no built-in method for computing PCA. If the input matrix is relatively thin, you can compute the covariance matrix in a distributed way, collect the results, and perform a local location on the driver.

The order of operations is more or less similar to the one below. The distributed steps are followed by the name of the operation, the local by "*" and an optional method.

Create an RDD[Vector] where each element is a single row from the input matrix. You can use numpy.ndarray for each line ( prallelize )
Calculate column statistics ( reduce )
Use the results from 2. to center the matrix ( map )
Calculate the external product for each row ( map outer )
Get results to get the covariance matrix ( reduce + )
Collect and compute eigendecomposition * ( numpy.linalg.eigh )
Choose the top n eigenvectors *
Program the data ( map )

Regarding Sklearn. You can use NumPy (it is already used in mllib ), SciPy, Scikit locally for the driver or worker in the same way as usual.

PSA Analysis in PySpark

Spark> = 1.5.0

Spark & ​​lt; 1.5.0

More articles:

Spark & lt; 1.5.0