What is the most suitable string distance algorithm that can be used to compare TV show names?

I am writing a scraper for TV shows and other materials (games, films, etc.), and not all sources are formatted the same for a particular show. For example, one source may represent subtitles with dashes, other semicolons. I am currently using Levenshtein distance to compare scraper data with data extracted from a TV show file name, but I was wondering if the algorithm was designed for short lines less than a long sentence. Is there an algorithm that is better suited to this need?

+5
source share
1 answer

Before comparing / measuring distance, you must normalize the names.

Normalization should include things like:

  • Basic formatting (e.g. UTF16 encoding, no spaces or tabs) /
  • Alphabet rules (e.g. Replace Ä with A)
  • Acronym for extension (e.g. NY → New York)
  • Rules for names of geographical names (for example, city names should not contain spaces, but dashes)
  • Capitalization rules (for example, each letter following a dash must be capitalized)
  • Deleting characters (e.g.!,?)
  • The number of conversions ("from three hundred" to "300")
  • Convert Roman numbers (for example, "Louis XVI" to "Louis 16").
  • Non-American English and American English (for example, "color" - "color").
  • Abbreviation rules (for example, Inc. instead of Incorporated, vs. vs. versus).

You can use the Levenshtein distance between pairs of words (not use it for the whole sentence), but implement some sliding window, as some words (for example, "The") may be missing in one of the representations.

+3
source

All Articles