I'm having problems with the Levenstein Distance algorithm.
I use the Levensteins distance algorithm to compare the product name with the list of product names to find the closest match. However, I need to tweak it a bit. I am using an example from dotnetperls.com .
Let's say I have a list A of 2000 product names from my own database. I sell all these products myself.
Then all of a sudden I get List B from one of my suppliers with product names and a new price for each product. This can happen more than once a year, so I want to develop software to do the work manually.
The problem is that this provider is not very good at consistency. Therefore, he makes small changes to the names from time to time, which means that I cannot perform a simple string comparison.
I implemented a distance algorithm, but that does not fit my needs. - still!
Looking through the list of my suppliers, I came across a product called
American anti-dandruff shampoo against 250 ml
This product has been successfully matched with my own product called
American crew against dandruff 250 ml.
At a distance of 10.
Problem
I also came across a product called
American Crew 3-In-1 Shampoo 450 ml.
Which mistakenly matched
Daily shampoo American Crew Daily 450 ml.
instead of mine
American crew 3 in 1,450 ml.
And I understand why! But I'm not sure how I can change the algorithm here.
Any ideas?
By the way, I'm not very good at algorithms, but I think that some kind of weighting would help me here.
EDIT:
Computing time is not really a concern. Even if it takes ten hours to complete, it's still a lot better than doing it manually: P