Retrieving old data after running PCA using SPARK

Question

Retrieving old data after running PCA using SPARK

I use PCA to reduce the m*n matrix to the m*2 matrix.

I am using a fragment inside apache spark site in my project and it works.

 import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix val mat: RowMatrix = ... // Compute the top 2 principal components. val pc: Matrix = mat.computePrincipalComponents(2) // Principal components are stored in a local dense matrix. // Project the rows to the linear space spanned by the top 2 principal components. val projected: RowMatrix = mat.multiply(pc)

I have not seen inside the API if there is a way to get the old data. To understand which PCA columns are selected as the main components.

Is there any library function that does this?

UPDATE

if the PCA algorithm selected and converted two columns of my data, I want to know how I can check in which columns of old data this conversion applies?

Example

multidimensional matrix:

 0 0 0 2 4 2 4 9 1 3 3 9 3 2 7 9 6 0 7 7

after the PCA algorithm with a decrease of 2 measurements, I get this:

 -1.4 3 2 -4.0 3 -2.9 -0.9 6

Said that, how can I understand which columns of the PCA have selected ,as principal components, from the source data for shrinking?

Thanks in advance.

+4

algorithm scala pca apache-spark

OiRc Aug 05 '15 at 6:46

source share

1 answer

Till rohrmann · Accepted Answer · 2015-08-05T07:40:25+0000

The pc matrix contains the main components as its columns. According to the docs:

Rows correspond to observations and columns correspond to variables. The main components are a local n-by-k matrix. Each column corresponds to one main component, and the columns are in descending order of component variance.

So you can look at the ith column by doing

 val pc: Matrix = ... val i: Int = ... for(row <- 0 until pc.numRows) { println(pc(row, i)) }

Update

If you have mat = input matrix

 0 0 0 2 4 2 4 9 1 3 3 9 3 2 7 9 6 0 7 7

where each row is one example and each column is a variable, then you can calculate the ATP. The two main components with the greatest dispersion: pc =

 0.6072 0.2049 0.3466 0.6626 -0.4674 0.7098 0.4343 -0.1024 0.3225 0.0689

Each column represents a projection direction to obtain one dimension of data reducing the dimension. To get now reduced dimension data, you compute mat * pc , which gives you

 2.1588 0.0706 -0.2041 9.5523 6.6652 8.9843 12.8425 5.5844

Here's what your data looks like when projected into a vector space of lower size. Here again, each row represents an example and each column is a variable.

If I understand your question correctly, then you are looking for the columns of the pc matrix, which indicate how much each source dimension affects the projected dimensions. Projection is just a scalar product of the source data with the direction of the projection ( pc columns).

Retrieving old data after running PCA using SPARK

More articles: