Before comparing / measuring distance, you must normalize the names.
Normalization should include things like:
- Basic formatting (e.g. UTF16 encoding, no spaces or tabs) /
- Alphabet rules (e.g. Replace Ä with A)
- Acronym for extension (e.g. NY → New York)
- Rules for names of geographical names (for example, city names should not contain spaces, but dashes)
- Capitalization rules (for example, each letter following a dash must be capitalized)
- Deleting characters (e.g.!,?)
- The number of conversions ("from three hundred" to "300")
- Convert Roman numbers (for example, "Louis XVI" to "Louis 16").
- Non-American English and American English (for example, "color" - "color").
- Abbreviation rules (for example, Inc. instead of Incorporated, vs. vs. versus).
You can use the Levenshtein distance between pairs of words (not use it for the whole sentence), but implement some sliding window, as some words (for example, "The") may be missing in one of the representations.
source share