Absolute String Metric

I have a huge (but finite) set of natural language strings.

I need a way to convert each string to a numeric value. For any given string, the value must be the same every time.

The more "different" two given lines, the more different two corresponding values ​​should be. The more "similar" they are, the less important the values ​​should be.

I still don't know what the exact definition of the difference between the lines I need is. No natural parsing of the language anyway. It should probably be something like Levenshtein (but Levenshtein is relative, and I need an absolute metric). Let's start with something simple.

Size Update

I will be happy to agree to a multidimensional (3d best) vector instead of a single numerical value.

Update expected result correctness

As has been rightly noted here and here , the distance from one line to another is a vector with dimensions MAX(firstStringLength, secondStringLength). In the general case, it is impossible to reduce the number of measurements without losing information.

However, I do not need an absolute solution. I would agree to any “good” conversion from N-dimensional row space to my 3D space.

Note also that I have a finite number of rows of finite length. (The number of rows is quite large, although about 80 million (10 GB), so I'd rather pick up a one-pass algorithm with no states.)

, . ...

  • ​​N- , N - . BTW, i- i- ?
  • N- .
  • , . ( ) - , .
  • 3D-, 3D , , .

? ?

+5
8

, . ( , )

  • "Hello World" = 0

2 :

  • "XXllo World" = a
  • "HeXXo World" = b
  • "Hello XXrld" = c
  • "Hello WorXX" = d

, 4 . , , :

a = 1, b = -1, c = 2, d = -2

, c to 0 2, c to a 1, 0 , a.

.

+5

, , ?

, , , . , , . , , "cat", , "bat", "hat", "rat", "can", "cot" .. , , , . " " " " , "" . , , , , , . , , "" , .

, : -, , -, , , , Levenstein ? , , , , . , , .

: , , , . f ( "cat" ) = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). , , , , . , , , (, ), . , S .

+3

, .

: , " " , ( , ). N- , :

distance projected onto X: (x,y,z).(1,0,0) = x

, , , :

(30,0,0).(1/3,1/3,1/3) = (0,30,0).(1/3,1/3,1/3) = (0,0,30).(1/3,1/3,1/3) = 10

, : , , Component Analysis, , . (.. , ).

3- , , PCA :

"acegikmoqsuwy" //use half your permitted symbols then repeat until you have a string of size equal to your longest string.
"bdfhjlnprtv" //use the other half then repeat as above.
"" //The empty string, this will just give you the length of the string, so a cheap one.

, , / : http://www.springer.com/mathematics/geometry/book/978-3-642-00233-5

levenstein: http://www.merriampark.com/ld.htm

+3

FryGuy , . aaaaaaaaaa baaaaaaaaa, abaaaaaaaa,..., aaaaaaaaab. 10, . 10 b - aaaaaaaaaa 1, 2. , N 2- , N- .

, .

+2

, , "1", , / , (etaoinshrdlu...), , , + .

+1

, .

- . ( "a", "b", "c", "t" ) 3, (a: 1, b: 1, c: 1, t: 1,..., a: 3, b: 3, c: 3, t: 3)

"cat" (0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1).

, , , (, SVD), . , . .

. SVD, , .

+1

" ", , , .

, "Origin". , .

, , , , .

0

This is the answer "to the top of the head" to the question.

In principle, this calculates distance sentence 2, different from sentence 1, as the Cartesian distance from sentence 1 (it is assumed that it is at the origin), where distances are the sum of Levenshtein’s minimum difference between a word in 2 sentences, It has the property that 2 equal sentences give a distance of 0.

If this approach was published elsewhere, I do not know about it.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string str1 = "The cat sat on the mat";
            string str2 = "The quick brown fox jumped over the lazy cow";
            ReportDifference(str1, str1);
            ReportDifference(str2, str2);
            ReportDifference(str1, str2);
            ReportDifference(str2, str1);
        }
        /// <summary>
        /// Quick test andisplay routine
        /// </summary>
        /// <param name="str1">First sentence to test with</param>
        /// <param name="str2">Second sentence to test with</param>
        static void ReportDifference(string str1, string str2)
        {
            Debug.WriteLine(
                String.Format("difference between \"{0}\" and \"{1}\" is {2}", 
                str1, str2, Difference(str1, str2))); 
        }
        /// <summary>
        /// This does the hard work.
        /// Basically, what it does is:
        /// 1) Split the stings into tokens/words
        /// 2) Form a cartesian product of the 2 lists of words. 
        /// 3) Calculate the Levenshtein Distance between each word.
        /// 4) Group on the words from the first sentance
        /// 5) Get the min distance between the word in first sentence and all of the words from the second
        /// 6) Square the distances for each word. 
        ///     (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances
        ///     what this assumes is the first word is the origin)
        /// 7) take the square root of sum
        /// </summary>
        /// <param name="str1">sentence 1 compare</param>
        /// <param name="str2">sentence 2 compare</param>
        /// <returns>distance calculated</returns>
        static double Difference(string str1, string str2)
        {
            string[] splitters = { " " };

            var a = Math.Sqrt(
                (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     select new {x, y, ld = Distance.LD(x,y)} )
                    .GroupBy(x => x.x)
                    .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                    .Sum(s =>  (double)(s.min_match * s.min_match )));
            return a;
        }
    }

    /// <summary>
    /// Lifted from http://www.merriampark.com/ldcsharp.htm
    /// </summary>
    public class Distance
    {

        /// <summary>
        /// Compute Levenshtein distance
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        public static int LD(string s, string t)
        {
            int n = s.Length; //length of s
            int m = t.Length; //length of t
            int[,] d = new int[n + 1, m + 1]; // matrix
            int cost; // cost
            // Step 1
            if (n == 0) return m;
            if (m == 0) return n;
            // Step 2
            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;
            // Step 3
            for (int i = 1; i <= n; i++)
            {
                //Step 4
                for (int j = 1; j <= m; j++)
                {
                    // Step 5
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                    // Step 6
                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    }
}
-1
source

All Articles