Matlab: how to find which variables from a dataset can be dropped using PCA in Matlab?

Question

Matlab: how to find which variables from a dataset can be dropped using PCA in Matlab?

I use PCA to find out which variables in my dataset are duplicated due to the high correlation with other variables. I use the matlab princomp function for data previously normalized with zscore:

[coeff, PC, eigenvalues] = princomp(zscore(x))

I know that the eigenvalues tell me how many variations of the data set each main component covers, and that coeff tells me how many of the i-th source variable is in the j-th main component (where I are rows, j are columns).

Thus, I suggested that in order to figure out which variables from the original dataset are the most important and which are the least, I have to multiply the coefficient matrix by eigenvalues. Coeff values represent how many each component each variable has, and eigenvalues - This component is important. So this is my complete code:

 [coeff, PC, eigenvalues] = princomp(zscore(x)); e = eigenvalues./sum(eigenvalues); abs(coeff)/e

But this does not show anything - I tried it on the following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):

  v1 v2 v3 1 3 4 2 4 -1 4 6 9 3 5 -2

but the results of my calculations were as follows:

 v1 0.5525 v2 0.5525 v3 0.5264

and it shows nothing. I expect the result for variable 2 will show that it is much less important than v1 or v3. Which of my statements is wrong?

+7

matlab pca

agnieszka Sep 28 '11 at 19:38

source share

1 answer

Jonas · Accepted Answer · 2011-09-28T21:12:53+0000

EDIT Now I have completely redesigned the answer to understand which assumptions were wrong.

Before explaining what does not work in OP, let me make sure that we have the same terminology. In the main component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and this can facilitate the description of the data, i.e. Various multidimensional observations in a space with a lower size. Observations are multidimensional when they are composed of several dimensions. If there are fewer linearly independent observations than there are measurements, we expect that at least one of the eigenvalues will be zero, since, for example, two linearly independent observation vectors in three-dimensional space can be described by a two-dimensional plane.

If we have an array

 x = [ 1 3 4 2 4 -1 4 6 9 3 5 -2];

consisting of four observations with three dimensions each, princomp(x) will find the inferior space spanned by four observations. Since there are two co-dependent dimensions, one of the eigenvalues will be close to zero, since the measurement space is only 2D, not 3D, which is probably the result you wanted to find. Indeed, if you check the eigenvectors ( coeff ), you will find that the first two components are extremely clearly collinear

 coeff = princomp(x) coeff = 0.10124 0.69982 0.70711 0.10124 0.69982 -0.70711 0.9897 -0.14317 1.1102e-16

Since the first two components, in fact, point in opposite directions, the values of the first two components of the transformed observations are meaningless in themselves: [1 1 25] equivalent to [1000 1000 25] .

Now, if we want to find out if any measurements are linearly dependent, and if we really want to use the main components for this, because in real life my measurements are not completely collinear, and we are interested in finding good descriptor vectors for machine application training, it is more reasonable to consider the three dimensions as' observations' and run princomp(x') . Since there are thus only three “observations” but four “dimensions”, the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we leave only two nonzero eigenvalues:

 eigenvalues = 24.263 3.7368 0 0

To find out which of the measurements is so strongly correlated (not really needed if you use measurements converted to your own vector as input, for example, for machine learning), the best way would be to look at the correlation between the measurements:

 corr(x) ans = 1 1 0.35675 1 1 0.35675 0.35675 0.35675 1

Not surprisingly, each dimension correlates perfectly with itself, and v1 correlates well with v2 .

EDIT2

but the eigenvalues tell us which vectors in the new space are most important (cover most of the variation), and the coefficients tell us how much each variable is in each component. therefore, I suggest that we can use this data to find out which of the source variables has the greatest variance and therefore is most important (and get rid of those that represent a small amount).

This works if your observations show very small variance in one measurement variable (for example, where x = [1 2 3;1 4 22;1 25 -25;1 11 100]; and, therefore, the first variable contributes nothing to variance ) However, in collinear measurements, both vectors contain equivalent information and make the same contribution to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to each other.

To keep @agnieszka's comments meaningful, I left starting points 1-4 of my answer below. Note that # 3 was in response to dividing the eigenvectors by eigenvalues, which for me did not make much sense.

vectors should be in rows, not in columns (each vector is an observation).
coeff returns the base vectors of the main component and its order has little to do with the original input
To see the importance of the main components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you cannot say that the first is important, but the second is not. How do you know that this should not be the other way around? If you want to check for colinearity, you must check the rank of the array or call unique for normalized (i.e., norm is 1) vectors.

Matlab: how to find which variables from a dataset can be dropped using PCA in Matlab?

More articles: