Algorithm for measuring the similarity between two sequences of strings

How can I measure the similarity fraction between two string sequences?

I have two text files, and in the files the sequences are written as

First file:

AAA BBB DDD CCC GGG MMM AAA MMM

Second file:

BBB DDD CCC MMM AAA MMM

How to measure the similarity between these two files in terms of line order?

For example, in the above example, both files are similar due to the order of the lines, but some lines are missing from file-2. Which algorithm is best suited to solve this problem, so that I can measure how similar the order of the lines is, and not the frequency of the lines in two?

+7
source share
2 answers

You can use the Levenstein Distance algorithm. It analyzes how many changes are required to convert one line to another. This article explains this pretty well, and provides an example implementation.

Copy paste from Codeproject :

1. Set n to be the length of s. ("GUMBO") Set m to be the length of t. ("GAMBOL") If n = 0, return m and exit. If m = 0, return n and exit. Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements. 2. Initialize v0 to 0..m. 3. Examine each character of s (i from 1 to n). 4. Examine each character of t (j from 1 to m). 5. If s[i] equals t[j], the cost is 0. If s[i] is not equal to t[j], the cost is 1. 6. Set cell v1[j] equal to the minimum of: a. The cell immediately above plus 1: v1[j-1] + 1. b. The cell immediately to the left plus 1: v0[j] + 1. c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost. 7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m]. 
+8
source

You can use the python function SequenceMatcher.ratio , which measures sequence similarity as a float in the range [0, 1] . If T is the total number of elements in both sequences, and M is the number of matches, this is 2.0 * M / T The main code is as follows:

 from difflib import SequenceMatcher text1 = 'AAA BBB DDD CCC GGG MMM AAA MMM' text2 = 'BBB DDD CCC MMM AAA MMM' s = SequenceMatcher(None, text1, text2) similarity = s.ratio() * 100 

Hope this helps you!

+6
source

All Articles