WordNet lemmatizer in NLTK: what is the right lemma for the "boss"?

I use nltk 3.0.4 and notice that the words for the words boss and bosses different.

 from nltk.stem.wordnet import WordNetLemmatizer wnl = WordNetLemmatizer() print wnl.lemmatize("boss", "n") # returns "bos" print wnl.lemmatize("bosses", "n") # returns "boss" 

From my point of view, this is a strange behavior, especially if boss is a well-known word in WordNet and there is a rule to save ss .

Does anyone have an explanation, or is this just a mistake? How can I handle this?

+4
source share
2 answers
  • After checking the code ( _morphy() ), which generates possible analyzes for the given word, I found that there is no rule included in ss .
  • Bos also a basic form in wordnet.

Replacement Rules:

 MORPHOLOGICAL_SUBSTITUTIONS = { NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], ADV: []} 

Call print wnl.lemmatize("boss", "n") :

Since a suitable base form ( Bos ) can be found by applying the substitution rules, it is returned. If this were not included in wordnet, then the lemma for boss would be boss , since a shorter form cannot be found.

+2
source

This is mistake. If a word that ends with "s" is a singular form of a noun, this word should always be returned as one of the answers when performing a noun lemmatization. This is the case with the boss, loss, moss, lens, etc. If "len" is a singular noun, it must also be returned. But there is an additional problem. When deleting "s" it detects an existing stock, "bos" in this case, this rod should NOT end with "s". The embedding rule "e", which applies to words that end with the words "s", "z", "x", "ch" and "sh", the plural of "boss" is "bosses". This certainly sounds like a better plural assumption of this outlier than the boss. The limitation that must be implemented is that the stem, if it is not designated as irregular, must create an input form when the multiple spelling rules are followed. Since “boss” does not give “boss” when spelling rules are applied, it should not be analyzed as the sole meaning of “boss”.

0
source

All Articles