Spark> = 1.5.0
Although PySpark 1.5 represents distributed data structures ( pyspark.mllib.linalg.distributed ), it seems that the API is quite limited and there is no implementation of the computePrincipalComponents method.
You can use either from pyspark.ml.feature.PCA or pyspark.mllib.feature.PCA . In the first case, the expected input is a data frame with a vector column:
from pyspark.ml.feature import PCA as PCAml from pyspark.ml.linalg import Vectors
In Spark 2.0 or later, use pyspark.ml.linalg.Vector instead of pyspark.ml.linalg.Vector .
For the mllib version you will need RDD Vector :
from pyspark.mllib.feature import PCA as PCAmllib rdd = sc.parallelize([ Vectors.dense([1, 2, 0]), Vectors.dense([2, 0, 1]), Vectors.dense([0, 1, 0])]) model = PCAmllib(2).fit(rdd) transformed = model.transform(rdd)
Spark & ββlt; 1.5.0
PySpark & ββlt; = 1.4.1 does not support distributed data structures, however, there is no built-in method for computing PCA. If the input matrix is ββrelatively thin, you can compute the covariance matrix in a distributed way, collect the results, and perform a local location on the driver.
The order of operations is more or less similar to the one below. The distributed steps are followed by the name of the operation, the local by "*" and an optional method.
- Create an
RDD[Vector] where each element is a single row from the input matrix. You can use numpy.ndarray for each line ( prallelize ) - Calculate column statistics (
reduce ) - Use the results from 2. to center the matrix (
map ) - Calculate the external product for each row (
map outer ) - Get results to get the covariance matrix (
reduce + ) - Collect and compute eigendecomposition * (
numpy.linalg.eigh ) - Choose the top n eigenvectors *
- Program the data (
map )
Regarding Sklearn. You can use NumPy (it is already used in mllib ), SciPy, Scikit locally for the driver or worker in the same way as usual.