Get rid of stop words and punctuation

I am struggling with stopping NLTK.

Here is my code. Can someone tell me what is wrong?

from nltk.corpus import stopwords def removeStopwords( palabras ): return [ word for word in palabras if word not in stopwords.words('spanish') ] palabras = ''' my text is here ''' 
+5
python nltk stop-words
source share
2 answers

Your problem is that the iterator for a string does not return every character, not every word.

For example:

 >>> palabras = "Buenos dias" >>> [c for c in palabras] ['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's'] 

You need to repeat and check every word, fortunately, the split function already exists in the python standard library in the string module . However, you are dealing with natural language, including punctuation, which you should look for here for a more reliable answer that the re module uses.

Once you have a list of words, you must enter them all before comparing, and then compare them as you have already shown.

Buena suerte.

EDIT 1

Try this code well, it should work for you. He shows two ways to do this, they are essentially identical, but the first is a little clearer, and the second is more pythons.

 import re from nltk.corpus import stopwords scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.' #We only want to work with lowercase for the comparisons scentence = scentence.lower() #remove punctuation and split into seperate words words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) #This is the simple way to remove stop words important_words=[] for word in words: if word not in stopwords.words('spanish'): important_words.append(word) print important_words #This is the more pythonic way important_words = filter(lambda x: x not in stopwords.words('spanish'), words) print important_words 

Hope this helps you.

+20
source share

Using a tokenizer, you first compare the list of tokens (characters) with a stop list, so you do not need the re module. I added an extra argument to switch between languages.

 def remove_stopwords(sentence, language): return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ] 

Dime si te fue de util;)

+2
source share

All Articles