To make Levenstein Distance algorithm suitable for my needs

I'm having problems with the Levenstein Distance algorithm.

I use the Levensteins distance algorithm to compare the product name with the list of product names to find the closest match. However, I need to tweak it a bit. I am using an example from dotnetperls.com .

Let's say I have a list A of 2000 product names from my own database. I sell all these products myself.

Then all of a sudden I get List B from one of my suppliers with product names and a new price for each product. This can happen more than once a year, so I want to develop software to do the work manually.

The problem is that this provider is not very good at consistency. Therefore, he makes small changes to the names from time to time, which means that I cannot perform a simple string comparison.

I implemented a distance algorithm, but that does not fit my needs. - still!

Looking through the list of my suppliers, I came across a product called

American anti-dandruff shampoo against 250 ml

This product has been successfully matched with my own product called

American crew against dandruff 250 ml.

At a distance of 10.

Problem

I also came across a product called

American Crew 3-In-1 Shampoo 450 ml.

Which mistakenly matched

Daily shampoo American Crew Daily 450 ml.

instead of mine

American crew 3 in 1,450 ml.

And I understand why! But I'm not sure how I can change the algorithm here.

Any ideas?

By the way, I'm not very good at algorithms, but I think that some kind of weighting would help me here.

EDIT:

Computing time is not really a concern. Even if it takes ten hours to complete, it's still a lot better than doing it manually: P

+7
source share
4 answers

In one approach, you can use several methods to find and apply weight to each of them. I assumed that you have an Item class with at least the Name string:

 double levWeight = 1.0; // adjust these weights as you see fit double matchWeight = 1.0; // add whatever to here you'd like var splitOn = new[] { ' ', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '-' }; Func<string, string[]> split = s => s.Split(splitOn, StringSplitOptions.RemoveEmptyEntries); var matches = from xx in mine let xp = split(xx.Name) select new { Item = xx, Matches = (from yy in theirs let yp = split(yy.Name) /* found how many name components match */ let mm = xp.Intersect(yp, StringComparer.OrdinalIgnoreCase).Count() /* find the Levenshtein distance of the two names */ let ld = LevDist(xx, yy) /* weight our two criteria */ let ww = (matchWeight * mm) + (levWeight / ld) /* should match on at least ONE name component */ where mm > 0 orderby ww descending select yy) }; 

When starting with the data body, I have the following output:

  • American Crew Dandruff Shampoo 250 ml
    • American Anti-Dandruff Team 250 ml
    • American crew Daily Shampoo 450 ml
    • American team 3 in 1,450 ml
  • American crew 3-in-1 shampoo 450 ml
    • American team 3 in 1,450 ml
    • American crew Daily Shampoo 450 ml
    • American Anti-Dandruff Team 250 ml

If you had more criteria, you would simply add them to the query (built into other let clauses) and apply some weight to their result.

Other possible applications are "name components in the same order":

 let or = xp.Zip(yp, (a,b) => String.Compare(a, b, true)) .TakeWhile(c => c == 0) .Count() 
+3
source

Can you combine Levensthein with a second pass without spaces, dots and a third pass that counts similar words and a fourth pass based on the first word in the product name?

Maybe you can take a look at neural networks? It will take you a long time to create a training database, but it may also be the answer (or part of the answer).

+1
source

You need to insert some of your external data knowledge. For example, in the case when you showed, the word โ€œshampooโ€ is not informative in relation to the product (ie, it does not add useful information about whether the two products are similar). If you removed the word โ€œshampooโ€ from your data, this method would work.

So one thing you can do is to compute computational words that are very common (appear in many words) and remove them from your data, as they may not be informative for your problem. Then run the algorithm.

The second little tweak is to use your "-" character knowledge. This character can be considered like a space character. '' So, replace any "-" with "" before running your algorithm.

Thus, you do not need to change the algorithm, but still use your knowledge of the problem (I understand that you want to avoid changing the algorithm).

Finally, you can think of more dramatic changes, for example, not using the Levenshtein distance or comparing whole words. You do not show your data set, but the Levenshtein distance in the form in which you are currently using it would be better if the differences were in a small number of characters (for example, typos).

+1
source

The algorithm does not give you a match, but a distance. therefore, in your problematic scenario, you must compare the distance that you get from two different offers, the lowest distance is better, but still a certain coincidence. I think the best solution would be to present the user with different product names that, in your opinion, will correspond based on the algorithm. Then the user can make the final choice. But if you do not mind the inconsistencies from time to time on the right track. Just pick a number for the distance that you still think is a coincidence, but that will give you a lot of false positives.

0
source

All Articles