Testing the NLTK Classifier in a Specific File

Question

Testing the NLTK Classifier in a Specific File

The following code launches the Naive Bayes movie review classifier . The code generates a list of the most informative functions.

Note: **movie review** folder is located in nltk .

 from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews stop = stopwords.words('english') documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()] word_features = FreqDist(chain(*[i for i,j in documents])) word_features = word_features.keys()[:100] numtrain = int(len(documents) * 90 / 100) train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]] classifier = NaiveBayesClassifier.train(train_set) print nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features(5)

code link from alvas

How can I check the classifier in a specific file ?

Please let me know if my question is ambiguous or incorrect.

+8

python-2.7 nlp classification nltk text-classification

Zam Mar 27 '15 at 13:34

source share

2 answers

You can test one file with classifier.classify (). This method takes as its dictionary a dictionary with functions as its keys, and True or False as their values, depending on whether this function is present in the document or not. It displays the most likely label for the file, according to the classifier. You can then compare this label with the correct label for the file to make sure the classification is correct.

In your training and test sets, functional dictionaries are always the first element in tuples, labels are the second element in tuples.

Thus, you can classify the first document in the test case as follows:

 (my_document, my_label) = test_set[0] if classifier.classify(my_document) == my_label: print "correct!" else: print "incorrect!"

+4

yvespeirsman Mar 29 '15 at 3:19

source share

alvas · Accepted Answer · 2015-03-29T11:10:20+0000

First, read these answers carefully, they contain parts of the required answers, and also briefly explain what the classifier does and how it works in NLTK:

nltk Training NaiveBayesClassifier for mood analysis
Using my own enclosure instead of movie_reviews corpus for classification in NLTK
http://www.nltk.org/book/ch06.html

Test classifier for annotated data

Now to answer your question. We assume that your question is a continuation of this question: Using my own enclosure instead of movie_reviews corpus for classification in NLTK

If your test text is structured in the same way as the movie_review body, you can simply read the test data as it is for the training data:

Just in case, the explanation of the code is unclear, here's a walkthrough:

 traindir = '/home/alvas/my_movie_reviews' mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

Two lines above is to read the my_movie_reviews directory with this structure:

 \my_movie_reviews \pos 123.txt 234.txt \neg 456.txt 789.txt README

Then the next line retrieves the documents with the pos/neg tag, which are part of the directory structure.

 documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

Here's an explanation of the line above:

 # This extracts the pos/neg tag labels = [i for i.split('/')[0]) for i in mr.fileids()] # Reads the words from the corpus through the CategorizedPlaintextCorpusReader object words = [w for w in mr.words(i)] # Removes the stopwords words = [w for w in mr.words(i) if w.lower() not in stop] # Removes the punctuation words = [w for w in mr.words(i) w not in string.punctuation] # Removes the stopwords and punctuations words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation] # Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

The TEST PROCESS should be applied when you read the test data.

Now for handling functions:

The following lines contain an additional 100 functions for the classifier:

 # Extract the words features and put them into FreqDist # object which records the no. of times each unique word occurs word_features = FreqDist(chain(*[i for i,j in documents])) # Cuts the FreqDist to the top 100 words in terms of their counts. word_features = word_features.keys()[:100]

Next to the processing of documents in a format suitable for classification:

 # Splits the training data into training size and testing size numtrain = int(len(documents) * 90 / 100) # Process the documents for training data train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] # Process the documents for testing data test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

Now, to explain that understanding the long list for train_set and `test_set:

 # Take the first `numtrain` no. of documents # as training documents train_docs = documents[:numtrain] # Takes the rest of the documents as test documents. test_docs = documents[numtrain:] # These extract the feature sets for the classifier # please look at the full explanation on https://stackoverflow.com/questions/20827741/nltk-naivebayesclassifier-training-for-sentiment-analysis/ train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in train_docs]

You need to process the documents as described above for the function extracts in the test documents.

So how can you read the test data:

 stop = stopwords.words('english') # Reads the training data. traindir = '/home/alvas/my_movie_reviews' mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts training data into tuples of [(words,label), ...] documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] # Now do the same for the testing data. testdir = '/home/alvas/test_reviews' mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts testing data into tuples of [(words,label), ...] test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]

Then continue with the processing steps above and just do it to get the label for the test document, as @yvespeirsman answered:

 #### FOR TRAINING DATA #### stop = stopwords.words('english') # Reads the training data. traindir = '/home/alvas/my_movie_reviews' mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts training data into tuples of [(words,label), ...] documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] # Extract training features. word_features = FreqDist(chain(*[i for i,j in documents])) word_features = word_features.keys()[:100] # Assuming that you're using full data set # since your test set is different. train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents] #### TRAINS THE TAGGER #### # Train the tagger classifier = NaiveBayesClassifier.train(train_set) #### FOR TESTING DATA #### # Now do the same reading and processing for the testing data. testdir = '/home/alvas/test_reviews' mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts testing data into tuples of [(words,label), ...] test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()] # Reads test data into features: test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in test_documents] #### Evaluate the classifier #### for doc, gold_label in test_set: tagged_label = classifier.classify(doc) if tagged_label == gold_label: print("Woohoo, correct") else: print("Boohoo, wrong")

If the above code and explanation do not make any sense to you, then you MUST read this guide before continuing: http://www.nltk.org/howto/classify.html

Now let's say that you have no comments in your test data, i.e. your test.txt not in a directory structure such as movie_review and just in a text file:

 \test_movie_reviews \1.txt \2.txt

Then it makes no sense to read it in a categorized case, you can just read and mark the documents, that is:

 for infile in os.listdir(`test_movie_reviews): for line in open(infile, 'r'): tagged_label = classifier.classify(doc)

BUT you CANNOT evaluate results without annotation , so you cannot check the tag if if-else , and you also need to tokenize the re text without using CategorizedPlaintextCorpusReader.

If you just want to mark the plaintext file test.txt :

 import string from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews from nltk import word_tokenize stop = stopwords.words('english') # Extracts the documents. documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()] # Extract the features. word_features = FreqDist(chain(*[i for i,j in documents])) word_features = word_features.keys()[:100] # Converts documents to features. train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents] # Train the classifier. classifier = NaiveBayesClassifier.train(train_set) # Tag the test file. with open('test.txt', 'r') as fin: for test_sentence in fin: # Tokenize the line. doc = word_tokenize(test_sentence.lower()) featurized_doc = {i:(i in doc) for i in word_features} tagged_label = classifier.classify(featurized_doc) print(tagged_label)

Once again, please do not just copy and paste the solution and try to understand why and how it works.

Testing the NLTK Classifier in a Specific File

More articles: