I am trying to create a generic synonym identifier for words in a sentence that are significant (for example, not "a" or "the"), and I am using the natural language toolkit (nltk) for python for it. The problem I am facing is that the synonym for finder in nltk requires part of the speech argument to be associated with its synonyms. My attempt to fix this was to use the simplified part of the speech tag in nltk, and then reduce the first letter to pass this argument to the search synonym, however this does not work.
def synonyms(Sentence): Keywords = [] Equivalence = WordNetLemmatizer() Stemmer = stem.SnowballStemmer('english') for word in Sentence: word = Equivalence.lemmatize(word) words = nltk.word_tokenize(Sentence.lower()) text = nltk.Text(words) tags = nltk.pos_tag(text) simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags] for tag in simplified_tags: print tag grammar_letter = tag[1][0].lower() if grammar_letter != 'd': Call = tag[0].strip() + "." + grammar_letter.strip() + ".01" print Call Word_Set = wordnet.synset(Call) paths = Word_Set.lemma_names for path in paths: Keywords.append(Stemmer.stem(path)) return Keywords
This is the code I'm working in now, and as you can see, I first lemmatize the input to reduce the number of matches that I will have in the long run (I plan to run this on tens of thousands of sentences), and, theoretically, I would set out this word after that, in order to continue this effect and reduce the number of redundant words that I generate, however this method almost always returns errors in the following form:
Traceback (most recent call last): File "C:\Python27\test.py", line 45, in <module> synonyms('spray reddish attack force') File "C:\Python27\test.py", line 39, in synonyms Word_Set = wordnet.synset(Call) File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset raise WordNetError(message % (lemma, pos)) WordNetError: no lemma 'reddish' with part of speech 'n'
I don't have much control over the data that will be executed, and so just cleaning my case is not really an option. Any ideas on how to solve this problem?
I have done some more research, and I have a promising leadership, but I still don't know how to implement this. In the case of a word not found or incorrectly assigned, I would like to use a similarity metric (Leacock Chodorow, Wu-Palmer, etc.) to associate this word with the nearest correctly classified other keyword. Perhaps in combination with measuring the distance of editing, but again I could not find any documentation about this.
python machine-learning nlp nltk wordnet
Slater victoroff
source share