NLTK stem words produce odd results

After running nltk.stem.porter.PorterStemmer().stem_word(word) I get a lot of words with disabled "ing" or "y" replaced by "i". for example, "Quality" becomes "Qualiti" and (even unfamiliar) "value" becomes "valuable"?

As the resulting words are not real English words, I'm not sure how I should use them? My best guess is that I have to put the words of the stem in another function that will give me all the derived / child words from that stem (for example, the โ€œvalueโ€ will return ['valuing','valued', 'values', ...] . Is there such a function?

+4
source share
2 answers

Stemming extracts the body of a word by going through a series of translation rules that supplant common suffixes and prefixes. Therefore, the result obtained cannot be a true English word. The common use of interruption is to normalize words so that they are considered the same. For instance:

 stem_word('value') == stem_word('valuing') 

Strikethrough words can be indexed for search. The same thing happens with an incoming query so that the query words match the words in the index when performing a search.

+4
source

I am not familiar with this particular function, but in general the word stem means the root of the word and is not necessarily a legal English word.

Do you use the nltk book? This chapter describes: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html

+1
source

All Articles