Deleting a stop word using NLTK

I try to process the text entered by the user by deleting stop words using the nltk toolkit, but when I delete by stop words, words like 'and', 'or', 'are not' deleted. I want these words to be present after the deletion process after stopping, because they are the operators that are needed for further processing of the text as a query. I do not know what words can be operators in a text query, and I also want to remove unnecessary words from my text.

+54
python nlp nltk stop-words
Oct 02 '13 at 5:29
source share
4 answers

I suggest you create your own list of operator words that you select from the list. Sets can be conveniently subtracted, therefore:

operators = set(('and', 'or', 'not')) stop = set(stopwords...) - operators 

Then you can simply check if the word in or not in set, without relying on whether your statements are part of the note list. Then you can switch to another list or add an operator.

 if word.lower() not in stop: # use word 
+50
Jun 08 '14 at 13:45
source share
— -

NLTK has a built-in list of notes consisting of 2400 stop words for 11 languages ​​(Porter and others), see http://nltk.org/book/ch02.html

 >>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words('english')) >>> sentence = "this is a foo bar sentence" >>> print([i for i in sentence.lower().split() if i not in stop]) ['foo', 'bar', 'sentence'] >>> [i for i in word_tokenize(sentence.lower()) if i not in stop] ['foo', 'bar', 'sentence'] 

I recommend looking at the use of tf-idf to remove stop words, see the effect of Stemming on the frequency of the term?

+123
Oct 02 '13 at 8:41
source share

@alvas answers, but it can be done faster. Suppose you have documents : a list of strings.

 from nltk.corpus import stopwords from nltk.tokenize import wordpunct_tokenize stop_words = set(stopwords.words('english')) stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation for doc in documents: list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words] 

Please note that due to the fact that here you are looking in a set (not in the list), the speed will theoretically be len(stop_words)/2 times faster, which is essential if you need to work through many documents.

For 5000 documents about 300 words in size, the difference is 1.8 seconds for my example and 20 seconds for @alvas.

PS In most cases, you need to divide the text into words in order to perform some other classification tasks for which tf-idf is used. Therefore, most likely, it would be better to use the stockmer:

 from nltk.stem.porter import PorterStemmer porter = PorterStemmer() 

and use [porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words] inside the loop.

+22
Sep 09 '15 at 1:27
source share

@alvas has a good answer. But again, it depends on the nature of the task, for example, in your application you want to consider all conjunction , for example. and, or, but, if, while and all determiner , for example. a, some, most, every, no, like stop words that consider all other parts of speech legal, then you may want to study this solution using the theses of Part of Speech to discard words, Check table 5.1 :

 import nltk STOP_TYPES = ['DET', 'CNJ'] text = "some data here " tokens = nltk.pos_tag(nltk.word_tokenize(text)) good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES] 
+9
Jun 13 '14 at 21:37
source share



All Articles