NLTK Wordnet Synset for Dictionary Phrase

I am working with the Python NLTK Wordnet API. I am trying to find the best synset which is a group of words.

If I need to find the best sync for something like β€œschool and office supplies,” I'm not sure how to do this. So far, I have tried to find synsets for single words, and then figured out the best lowest common hypernim, for example:

def find_best_synset(category_name): text = word_tokenize(category_name) tags = pos_tag(text) node_synsets = [] for word, tag in tags: pos = get_wordnet_pos(tag) if not pos: continue node_synsets.append(wordnet.synsets(word, pos=pos)) max_score = 0 max_synset = None max_combination = None for combination in itertools.product(*node_synsets): for test in itertools.combinations(combination, 2): score = wordnet.path_similarity(test[0], test[1]) if score > max_score: max_score = score max_combination = test max_synset = test[0].lowest_common_hypernyms(test[1]) return max_synset 

However, this is not very good, plus it is very expensive. Is there a way to figure out which synset best represents multiple words together?

Thank you for your help!

+5
source share
1 answer

Besides what I already said in the comments, I think that the way to choose the best hyperonym can be spoiled. The synthesis you ended up with is not the lowest common hyperonym for all words, but only for two of them.

Let it stick to your example of "school and office supplies." For each word in the expression, you get several synsets. Therefore, the node_synsets variable will look something like this:

 [[school_1, school_2], [office_1, office_2, office_3], [supply_1]] 

In this example, there are 6 ways to combine each synchronism with any of the others:

 [(school_1, office_1, supply_1), (school_1, office_2, supply_1), (school_1, office_3, supply_1), (school_2, office_1, supply_1), (school_2, office_2, supply_1), (school_2, office_3, supply_1)] 

These triples are what you repeat in the outer for loop (with itertools.product ). If the expression has 4 words, you will iterate over the fours, with 5 of its five, etc.

Now that the inner for loop, you join each triple. The first one is:

 [(school_1, office_1), (school_1, supply_1), (office_1, supply_1)] 

... and you define the lowest hyperonym among each pair. Therefore, in the end, you get the lowest hyperonym, say, school_2 and office_1 , which can be a kind of institution. This is probably not very significant, since it does not take into account any synchronization of the last word.

Perhaps you should try to find the smallest common hyperonym of all three words in each combination of their syntaxes and take one of them, the best among them.

+4
source

All Articles