Is it possible to apply PCA to any text classification?

Question

Is it possible to apply PCA to any text classification?

I am trying to classify using python. I use the Naive Bayes MultinomialNB classifier for web pages (extracting a web form of data into text, later I classify this text: classification on the Internet).

Now I am trying to apply PCA to this data, but python gives some errors.

My classification code with Naive Bayes:

from sklearn import PCA from sklearn import RandomizedPCA from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB vectorizer = CountVectorizer() classifer = MultinomialNB(alpha=.01) x_train = vectorizer.fit_transform(temizdata) classifer.fit(x_train, y_train)

This classification of naive bays gives the result:

 >>> x_train <43x4429 sparse matrix of type '<class 'numpy.int64'>' with 6302 stored elements in Compressed Sparse Row format> >>> print(x_train) (0, 2966) 1 (0, 1974) 1 (0, 3296) 1 .. .. (42, 1629) 1 (42, 2833) 1 (42, 876) 1

What am I trying to apply PCA to my data ( temizdata ):

 >>> v_temizdata = vectorizer.fit_transform(temizdata) >>> pca_t = PCA.fit_transform(v_temizdata) >>> pca_t = PCA().fit_transform(v_temizdata)

but it brings up the following erros:

raise TypeError ('The resolved matrix was passed but dense' TypeError: A a sparse matrix was resolved, but dense data is needed. Use X.toarray () to convert to a dense numpy array.

Convert matrix to densematrix or numpy matrix. Then I tried to create a new densematrix, but I have an error.

My main goal is that the PCA test affects the classification of the text.

Convert to a dense array:

 v_temizdatatodense = v_temizdata.todense() pca_t = PCA().fit_transform(v_temizdatatodense)

Finally try classfy:

 classifer.fit(pca_t,y_train)

error for final classfy:

raise the value of ValueError ("Input X must be non-negative") ValueError: Input X must be non-negative

On the one hand, my data ( temizdata ) is placed only in Naive Bayes, on the other hand, temizdata first placed in the PCA (to reduce inputs), which are classified. __

+11

python scikit-learn pca naivebayes

zer03 Jan 11 '16 at 15:52

source share

3 answers

Imanol luengo · Answer 1 · 2016-01-11T16:26:31+0000

Instead of converting the sparse matrix to dense (which is discouraged), I would use scikits-learn TruncatedSVD , which is a PCA-like brightness reduction algorithm (using the default Randomized SVD) that works with sparse data:

 svd = TruncatedSVD(n_components=5, random_state=42) data = svd.fit_transform(data)

And, referring to the TruncatedSVD documentation:

In particular, truncated SVD works on the count count / tf-idf matrices returned by the vectorizers in sklearn.feature_extraction.text. In this context, it is known as covert semantic analysis (LSA).

which is exactly your precedent.

kazemakase · Answer 2 · 2016-01-11T16:27:01+0000

The NaiveBayes classifier needs discrete-valued functions, but the PCA violates this property of functions. You will need to use a different classifier if you want to use a PCA.

There may be other dimensional reduction methods that work with NB, but I don't know about that. Perhaps a simple function selection might work.

Note: you can try to discretize functions after applying ATP, but I don't think this is a good idea.

Justin scofield · Answer 3 · 2019-08-14T21:13:46+0000

The problem is that by applying dimension reduction, you will generate negative features. However, Multinominal NB does not accept negative traits. Please refer to these questions.

Try a different classifier like RandomForest, or try using sklearn.preprocessing.MinMaxScaler() to scale your training functions to [0,1]

Is it possible to apply PCA to any text classification?

More articles: