First, read these answers carefully, they contain parts of the required answers, and also briefly explain what the classifier does and how it works in NLTK:
Test classifier for annotated data
Now to answer your question. We assume that your question is a continuation of this question: Using my own enclosure instead of movie_reviews corpus for classification in NLTK
If your test text is structured in the same way as the movie_review body, you can simply read the test data as it is for the training data:
Just in case, the explanation of the code is unclear, here's a walkthrough:
traindir = '/home/alvas/my_movie_reviews' mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
Two lines above is to read the my_movie_reviews directory with this structure:
\my_movie_reviews \pos 123.txt 234.txt \neg 456.txt 789.txt README
Then the next line retrieves the documents with the pos/neg tag, which are part of the directory structure.
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
Here's an explanation of the line above:
# This extracts the pos/neg tag labels = [i for i.split('/')[0]) for i in mr.fileids()] # Reads the words from the corpus through the CategorizedPlaintextCorpusReader object words = [w for w in mr.words(i)] # Removes the stopwords words = [w for w in mr.words(i) if w.lower() not in stop] # Removes the punctuation words = [w for w in mr.words(i) w not in string.punctuation] # Removes the stopwords and punctuations words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation] # Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
The TEST PROCESS should be applied when you read the test data.
Now for handling functions:
The following lines contain an additional 100 functions for the classifier:
Next to the processing of documents in a format suitable for classification:
# Splits the training data into training size and testing size numtrain = int(len(documents) * 90 / 100) # Process the documents for training data train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] # Process the documents for testing data test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
Now, to explain that understanding the long list for train_set and `test_set:
You need to process the documents as described above for the function extracts in the test documents.
So how can you read the test data:
stop = stopwords.words('english') # Reads the training data. traindir = '/home/alvas/my_movie_reviews' mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts training data into tuples of [(words,label), ...] documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()] # Now do the same for the testing data. testdir = '/home/alvas/test_reviews' mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii') # Converts testing data into tuples of [(words,label), ...] test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
Then continue with the processing steps above and just do it to get the label for the test document, as @yvespeirsman answered:
If the above code and explanation do not make any sense to you, then you MUST read this guide before continuing: http://www.nltk.org/howto/classify.html
Now let's say that you have no comments in your test data, i.e. your test.txt not in a directory structure such as movie_review and just in a text file:
\test_movie_reviews \1.txt \2.txt
Then it makes no sense to read it in a categorized case, you can just read and mark the documents, that is:
for infile in os.listdir(`test_movie_reviews): for line in open(infile, 'r'): tagged_label = classifier.classify(doc)
BUT you CANNOT evaluate results without annotation , so you cannot check the tag if if-else , and you also need to tokenize the re text without using CategorizedPlaintextCorpusReader.
If you just want to mark the plaintext file test.txt :
import string from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews from nltk import word_tokenize stop = stopwords.words('english')
Once again, please do not just copy and paste the solution and try to understand why and how it works.