Normalization of the list of restaurant dishes

Question

Normalization of the list of restaurant dishes

I have a large set of sets of dishes in a restaurant (for example, "Pulled Pork", "Beef Brisket" ...)

I am trying to "normalize" (the wrong word) dishes. I want Pulled Pork and Pulled Pork Sandwich and Jumbo Pork Slider to come together for the same Pulled Pork.

So far, I started working with NLTK using Python, and I had fun playing with frequency distributions, etc.

Does anyone have a high-level strategy to solve this problem? Perhaps some keywords I could google?

thanks

+5

python machine-learning nlp nltk

George B Aug 26 '15 at 19:37

source share

2 answers

It looks like you are effectively trying to resolve code notation on named objects, where the entities are different dishes. You can check out projects like cort and nltk-drt .

However, from your example, it’s a little unclear why a seasoned pork sandwich should be considered the same dish as pulled pork, so you might need a way to create your own set of workouts (like one knocked out from Google) that marks objects based on your desired tolerance .

+1

lemonhead Aug 26 '15 at 19:50

source share

Sait · Accepted Answer · 2015-08-26T19:51:40+0000

You might want to find TFIDF and cosine similarity .

However, there are complex cases. Let's say you have the following three dishes:

Pork
Elongated egg
Egg sandwich

Which of the two are you going to combine?

Pulled pork and pull out the egg.
Bottled Egg and Egg Sandwich

Using TFIDF , you can find the most representative words. For example, the word "sandwich" may appear in many dishes, therefore, is not very representative. (Tuna sandwich, egg sandwich, cheese sandwich, etc.). Mixing a tuna sandwich and a cheese sandwich may not be a good idea.

Once you have TFIDF vectors, you can use the cosine similarity (using TFIDF vectors) and perhaps a static threshold, you can decide whether to combine them or not.

Another problem arises: when you agree, what will you name them? (Egg or egg sandwich?)

Update:

@alvas suggests using clustering after affinity / distinguishability values. I think that would be a good idea. First you can create your nxn distance / similarity nxn using cosine similarity with TFIDF vectors. And after you have a distance matrix, you can group them using the clustering algorithm.

Normalization of the list of restaurant dishes

Update:

More articles: