Common synonym and part of speech processing using nltk

Question

Common synonym and part of speech processing using nltk

I am trying to create a generic synonym identifier for words in a sentence that are significant (for example, not "a" or "the"), and I am using the natural language toolkit (nltk) for python for it. The problem I am facing is that the synonym for finder in nltk requires part of the speech argument to be associated with its synonyms. My attempt to fix this was to use the simplified part of the speech tag in nltk, and then reduce the first letter to pass this argument to the search synonym, however this does not work.

def synonyms(Sentence): Keywords = [] Equivalence = WordNetLemmatizer() Stemmer = stem.SnowballStemmer('english') for word in Sentence: word = Equivalence.lemmatize(word) words = nltk.word_tokenize(Sentence.lower()) text = nltk.Text(words) tags = nltk.pos_tag(text) simplified_tags = [(word, simplify_wsj_tag(tag)) for word, tag in tags] for tag in simplified_tags: print tag grammar_letter = tag[1][0].lower() if grammar_letter != 'd': Call = tag[0].strip() + "." + grammar_letter.strip() + ".01" print Call Word_Set = wordnet.synset(Call) paths = Word_Set.lemma_names for path in paths: Keywords.append(Stemmer.stem(path)) return Keywords

This is the code I'm working in now, and as you can see, I first lemmatize the input to reduce the number of matches that I will have in the long run (I plan to run this on tens of thousands of sentences), and, theoretically, I would set out this word after that, in order to continue this effect and reduce the number of redundant words that I generate, however this method almost always returns errors in the following form:

 Traceback (most recent call last): File "C:\Python27\test.py", line 45, in <module> synonyms('spray reddish attack force') File "C:\Python27\test.py", line 39, in synonyms Word_Set = wordnet.synset(Call) File "C:\Python27\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1016, in synset raise WordNetError(message % (lemma, pos)) WordNetError: no lemma 'reddish' with part of speech 'n'

I don't have much control over the data that will be executed, and so just cleaning my case is not really an option. Any ideas on how to solve this problem?

I have done some more research, and I have a promising leadership, but I still don't know how to implement this. In the case of a word not found or incorrectly assigned, I would like to use a similarity metric (Leacock Chodorow, Wu-Palmer, etc.) to associate this word with the nearest correctly classified other keyword. Perhaps in combination with measuring the distance of editing, but again I could not find any documentation about this.

+8

python machine-learning nlp nltk wordnet

Slater victoroff Jun 12 '12 at 22:01

source share

2 answers

Can you wrap Word_Set = wordnet.synset(Call) with try: and ignore the WordNetError exception? It seems like you have a mistake that some words are not classified correctly, but this exception also occurs for unrecognized words, so catching an exception just seems like a good idea to me.

+1

Chipjust Jun 13 '12 at 0:11

source share

Slater victoroff · Accepted Answer · 2012-06-14T04:03:30+0000

Apparently, nltk allows you to extract all the syntheses associated with the word. Of course, usually their number reflects different semantic feelings. In order to functionally find synonyms (or, if two words are synonyms), you should try to match the closest set of synonyms, which is possible through any of the similarity metrics mentioned above. I developed the base code to do this, as shown below, to find out if two words are synonyms:

 from nltk.corpus import wordnet from nltk.stem.wordnet import WordNetLemmatizer import itertools def Synonym_Checker(word1, word2): """Checks if word1 and word2 and synonyms. Returns True if they are, otherwise False""" equivalence = WordNetLemmatizer() word1 = equivalence.lemmatize(word1) word2 = equivalence.lemmatize(word2) word1_synonyms = wordnet.synsets(word1) word2_synonyms = wordnet.synsets(word2) scores = [i.wup_similarity(j) for i, j in list(itertools.product(word1_synonyms, word2_synonyms))] max_index = scores.index(max(scores)) best_match = (max_index/len(word1_synonyms), max_index % len(word1_synonyms)-1) word1_set = word1_synonyms[best_match[0]].lemma_names word2_set = word2_synonyms[best_match[1]].lemma_names match = False match = [match or word in word2_set for word in word1_set][0] return match print Synonym_Checker("tomato", "Lycopersicon_esculentum")

I can try to implement progressively stronger algorithms, but for the first few tests I did, this code really worked for every word I could find. If anyone has ideas on how to improve this algorithm, or have anything to improve this answer, I would love to hear it.

Common synonym and part of speech processing using nltk

More articles: