Scikit-learn TruncatedSVD explained the deviation coefficient not in descending order

The TruncatedSVD dispersion coefficient explained is not in descending order, unlike the sclear PCA. I looked at the source code, and it seems that they use a different way of calculating the explained variance relation:

TruncatedSVD :

U, Sigma, VT = randomized_svd(X, self.n_components, n_iter=self.n_iter, random_state=random_state) X_transformed = np.dot(U, np.diag(Sigma)) self.explained_variance_ = exp_var = np.var(X_transformed, axis=0) if sp.issparse(X): _, full_var = mean_variance_axis(X, axis=0) full_var = full_var.sum() else: full_var = np.var(X, axis=0).sum() self.explained_variance_ratio_ = exp_var / full_var 

ATP :

 U, S, V = linalg.svd(X, full_matrices=False) explained_variance_ = (S ** 2) / n_samples explained_variance_ratio_ = (explained_variance_ / explained_variance_.sum()) 

PCA uses sigma to directly compute the explained variable, and since the sigma is in descending order, the explained variable is also in descending order. On the other hand, TruncatedSVD uses the variance of the columns of the transformed matrix to calculate the explained_variance, and therefore, the variances are not necessarily in descending order.

Does this mean that I need to sort explained_variance_ratio from TruncatedSVD first to find the main components of the k principle?

+10
python scikit-learn pca svd
source share
2 answers

You do not need to sort explianed_variance_ratio , the output itself will be sorted and contains only the number of n_component values.
From the documentation :

TruncatedSVD implements a singular value decomposition (SVD) option that calculates only the largest singular values, where k is the parameter specified by the user.

X_transformed contains decomposition using only k components.

Example will give you an idea

 >>> from sklearn.decomposition import TruncatedSVD >>> from sklearn.random_projection import sparse_random_matrix >>> X = sparse_random_matrix(100, 100, density=0.01, random_state=42) >>> svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42) >>> svd.fit(X) TruncatedSVD(algorithm='randomized', n_components=5, n_iter=7, random_state=42, tol=0.0) >>> print(svd.explained_variance_ratio_) [0.0606... 0.0584... 0.0497... 0.0434... 0.0372...] >>> print(svd.explained_variance_ratio_.sum()) 0.249... >>> print(svd.singular_values_) [2.5841... 2.5245... 2.3201... 2.1753... 2.0443...] 
0
source share

Sorry for the uneven answer, but I have exactly the same question, and I cannot find a satisfactory explanation. Why is the explained_variance_ratio_ of TruncatedSVD not in descending order, as it would be from the PCA ? In my experience, it seems that the first element of the list is always the lowest, and then the value of the second element jumps up, and then goes in descending order from there. Why explained_variance_ratio_[0] < explained_variance_ratio_[1] (> explained_variance_ratio_[2] > explained_variance_ratio_[3] ...)? Does this mean that the second β€œcomponent” actually explains the greatest variance (not the first)?

0
source share

All Articles