PCA in matlab that selects the top n components

I want to select the top N=10,000 core components from the matrix. Upon completion of pca, MATLAB should return the pxp matrix, but it is not!

 >> size(train_data) ans = 400 153600 >> [coefs,scores,variances] = pca(train_data); >> size(coefs) ans = 153600 399 >> size(scores) ans = 400 399 >> size(variances) ans = 399 1 

It should be coefs:153600 x 153600 ? and scores:400 X 153600 ?

When I use the code below, it gives an Out of Memory :: error

 >> [VD] = eig(cov(train_data)); Out of memory. Type HELP MEMORY for your options. Error in cov (line 96) xy = (xc' * xc) / (m-1); 

I do not understand why MATLAB returns a smaller dimensional matrix. This should return an error using pca: 153600 * 153600 * 8 bytes = 188 GB

Error with errors:

 >> eigs(cov(train_data)); Out of memory. Type HELP MEMORY for your options. Error in cov (line 96) xy = (xc' * xc) / (m-1); 
+7
source share
3 answers

Foreword

I think you are a victim of the XY problem, since trying to find 153,600 measurements in your data is completely unphysical, ask about the problem (X), not your proposed solution (Y), to get a meaningful answer. I will use this post only to tell you why the PCA is not suitable for this case. I canโ€™t tell you what will solve your problem, because you did not tell us what it is.

This is a mathematically unreasonable problem, as I will try to explain here.

PCA

PCA, as user 3149915 said, is a way to downsize. This means that somewhere in your problem you have one hundred fifty three thousand six hundred dimensions floating around. It's a lot. Extremely much. Explaining the physical reason for the existence of all of them can be a big problem than trying to solve a mathematical problem.

Trying to establish that many measurements of up to 400 observations will not work, because even if all the observations are linear independent vectors in your object space, you can still extract only 399 measurements, since the rest simply cannot be found, since there is no observation . You can maximally correspond to N-1 unique measurements through N points, other sizes have an infinite number of placement possibilities. Like trying to set a plane through two points: there the line that you can put through those and the third dimension will be perpendicular to this line, but undefined in the direction of rotation. Therefore, you are left with an infinite number of possible planes that correspond to these two points.

I do not think that you are trying to adjust the โ€œnoiseโ€ after the first 400 components, I think that after that you arrange a void. You used all your data to get dimensions and could not create more dimensions. Impossible. All you can do is get more observations, about 1.5M, and run the PCA again.

More observations than sizes

Why do you need more observations than sizes? you can ask. Itโ€™s easy, you canโ€™t pick up a unique line through a point, not a unique plane through two points, as well as a unique hyperplane with a diagonal of 153,600 by 400 points.

So, if I get 153,600 observations, am I tuned?

Unfortunately not. If you have two points and go to it, you will get 100% match. No mistake, Jay! Made during the day, let go home and watch TV! Unfortunately, your boss will call you the next morning, as your seizures are rubbish. What for? Well, if you had, for example, 20 points scattered around, then the correspondence would not have been without errors, but at least closer to presenting your actual data, since the first two can be outliers, see this very illustrative the figure where the red dots will be your first two observations:

enter image description here

If you were to extract the first 10,000 components, that would be 399 exact approaches and 9,601 zero sizes. Could not even try to calculate beyond the 399th dimension, and stick to this in a zero array with 10,000 entries.

TL DR You cannot use the PCA, and we cannot help you solve your problem until you tell us what the problem is.

+13
source

PCA is a size reduction algorithm, so it tries to reduce the number of functions to core components (PCs), each of which represents a linear combination of common characteristics. All this is done in order to reduce the size of the space of objects, i.e. Turn a large space of objects into a more manageable, but at the same time save most, if not all, of the information.

Now for your problem you are trying to explain the deviations in your 400 observations using the functions of 153600, however we do not need so much information. 399 pcs will explain 100% variance for your sample (I will be very surprised if this is not so). The reason for this is mainly retraining, your algorithm finds a noise that explains every observation in your sample.

So, what Rayrieng told you, correctly, if you want to reduce the space by 10,000 PCs, you will need 100,000 observations for the PC to mean something (this is a rule, but pretty stable).

And the reason Matlab gave you 399 PCs was because it was able to correctly extract 399 linear combinations that explained some #% of the variance in your sample.

If, on the other hand, what you need are the most important functions, than you are not looking for downsizing streams, but rather processes for eliminating symptoms. They will retain only the most important function when zeroing irrelevant ones.

So, just to understand if your spatial space is garbage and there is no information, there is just noise, the explanation of variance will be irrelevant and really will be less than 100%, for example, see the following

 data = rand(400,401); [coefs,scores,variances] = pca(data); numel(variances) disp('Var explained ' num2str(cumsum(variances)) '%']) 

Again, if you want to reduce the space of objects, there are ways to do this even with small m, but the PCA is not one of them.

Luck

+6
source

Matlab tries not to waste too much resources computing it. But you can still do what you want, just use:

 pca(train_data,'Economy','off') 
0
source

All Articles