@alvas answers, but it can be done faster. Suppose you have documents : a list of strings.
from nltk.corpus import stopwords from nltk.tokenize import wordpunct_tokenize stop_words = set(stopwords.words('english')) stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
Please note that due to the fact that here you are looking in a set (not in the list), the speed will theoretically be len(stop_words)/2 times faster, which is essential if you need to work through many documents.
For 5000 documents about 300 words in size, the difference is 1.8 seconds for my example and 20 seconds for @alvas.
PS In most cases, you need to divide the text into words in order to perform some other classification tasks for which tf-idf is used. Therefore, most likely, it would be better to use the stockmer:
from nltk.stem.porter import PorterStemmer porter = PorterStemmer()
and use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words] inside the loop.
Salvador Dali Sep 09 '15 at 1:27 2015-09-09 01:27
source share