Algorithm for measuring the distance between disordered sequences

Question

Algorithm for measuring the distance between disordered sequences

Levenshtein distance gives us a way to calculate the distance between two similar lines in terms of unordered individual characters:

  quick brown fox
 quikc brown fax

Levenshtein distance = 3.

What is a similar algorithm for the distance between two rows with similar subsequences? For example, in

  quickbrownfox
 brownquickfox

Levenshtein’s distance is 10, but this does not take into account the fact that the strings have two identical subsequences, which makes them more “similar” than completely disordered words like

  quickbrownfox
 qburiocwknfox

and yet this completely disordered version has a Levenshtein distance of eight.

What distance measures exist that take into account the length of subsequences without assuming that subsequences can easily be broken down into separate words?

+6

algorithm

user181548 May 18, '10 at 11:17

source share

5 answers

Manvel · Answer 1 · 2010-05-18T14:12:26+0000

I think you can try shingles or some combination of them with Levenshtein distance.

mathmike · Answer 2 · 2010-05-19T06:20:22+0000

One simple metric is to take all n * (n-1) / 2 substrings in each row and see how many overlaps. There are some simple options in this approach when you only look at substrings of a certain length.

This will be similar to the BLEU metric commonly used to evaluate machine translations. In the case of BLEU, they compare two sentences: they take all the characters, bigrams, trigrams and 4 grams of words from each sentence. They calculate the version of accuracy and recall for each and, in essence, use the average of these scores.

jk. · Answer 3 · 2010-05-18T11:29:35+0000

Initial hit: use the diff algorithm and the number of differences as distance

Roman · Answer 4 · 2010-05-18T14:57:54+0000

I got the impression that this is an NP-complete problem.

At least I don’t see how we can avoid an exhaustive search. Moreover, I don’t even see how we can test this solution in polynomial time.

Laz · Answer 5 · 2010-05-19T06:29:21+0000

well, the problem you are talking about comes under context-sensitive grammar. In this case, you basically determine the grammar, the grammar of the English language, and then find the distance between the grammar and the inconsistency. First you need to analyze your input.

Algorithm for measuring the distance between disordered sequences

More articles: