"anderstornvig" mentioned Levenshtein / editing distance, which is a great idea, but not entirely appropriate, because some permutations are more significant than other permutations. The problem is that we use a lot of domain knowledge when we determine which differences are “significant” and which are “not significant”. For example, we know that the hyphen in "Half-Blood Prince" is very important, but the number in "Firefox 3" is very important.
For this reason, you might consider setting up a simple metric such as Levenshtein. Add options that let you customize which differences are important and which are irrelevant.
In particular, Levenshtein counts the number of “edits” (that is, insertions, deletions, and replacements) needed to turn one line into another. Effectively, it weighs every edit the same. You can write an implementation that will put some changes in different ways. For example, changing "-" to "should have a very low weight (which indicates unimportance). Changing" 3 "to" 2 ", when the number is left alone, should have a very high weight (which indicates a high value).
By parameterizing the calculation, you create a prospectus for continuous improvement of your algorithm. Create the initial configuration and run it on some test data. Find places where the metric is weak - where it combines two terms that you think should be separated, for example - and change the parameterization until you are satisfied.
Thus, you can train your algorithm using the knowledge of your domain.
adrian
source share