Comparing multiple words with Levenshtein names

Question

Comparing multiple words with Levenshtein names

I compare the names of buildings on my campus with data from various databases. People entered these names, and each uses its own abbreviation scheme. I am trying to find the best match with user input in canonical form name.

I have implemented the recursive Distance Levenshtein method, but there are a few edge cases that I am trying to solve. My implementation is included in GitHub .

Some building names are one word, while others are two. One word on one word gives fairly accurate results, but there are two things that I need to keep in mind.

Abbreviations: Assuming the input is an abbreviated version of the name, sometimes I can get the same Levenshtein Distance between the input and an arbitrary name, as well as with the correct name. For example, if my input is " Ing" and the building names ^1. are equal ["Boylan", "Ingersoll", "Whitman", "Whitehead", "Roosevelt", and "Library"], I get LD from 6 for Boylanand Ingersoll. Desired result here Ingersoll.
Multiplayer line: . The second edge is when the input and / or output are two words separated by a space. For example, New Ingis an abbreviation for New Ingersoll. In this case, New Ingersoll and Boylan both estimate the distance of Levenshtein 6. If I split the lines, Newcombine perfectly with New, and then I just need to return to solving my previous case with the edge.

What is the best way to handle these two edge cases?

<sub> 1. For the curious, these are buildings at Brooklyn College in New York.

+4

string algorithm objective-c levenshtein distance

Moshe Nov 06 '14 at 19:36

source share

2 answers

, , , . , , . ( "Ing" "Boylan" , . "Ing" "Boylan", , , .) , , "" "" , .

, . , , , . , . .

, -. Ingersoll New Ingersoll? , 100, , . - , -100. :

- "Ingersoll":

"" 100 / 1 == 100
" " 100 / 2 == 50

"New Ingersoll":

"" (100 - 100) / 1 == 100
" " (100 + 100) / 2 == 100

, , , . "NI" "NIng" New Ingersoll, , , , .

( , , ).

+2

M Oehm 06 . '14 20:54

Josh Caswell · Accepted Answer · 2014-11-06T22:41:31+0000

I think you should use the length. The longest common subsequence instead of Levenshtein distance. This seems like the best indicator for your business. Essentially, it prioritizes insertions and deletions over replacements, as I suggested in my comment.

"" → "" "" → "" ( 3 1) ( " " → " " 7, " " → "" 1), , "".

. m n, - ( ), m + 1, n + 1. - , ( ); ( , , ). , - LCS.

"Ingsll" "Ingersoll":

      0 1 2 3 4 5 6
        I n g s l l
    ---------------
0   | 0 0 0 0 0 0 0
1 I | 0 1 1 1 1 1 1
2 n | 0 1 2 2 2 2 2
3 g | 0 1 2 3 3 3 3
4 e | 0 1 2 3 3 3 3
5 r | 0 1 2 3 3 3 3
6 s | 0 1 2 3 4 4 4
7 o | 0 1 2 3 4 4 4
8 l | 0 1 2 3 4 5 5
9 l | 0 1 2 3 4 5 6

ObjC. , - @"o̶" - .

#import <Foundation/Foundation.h>

@interface NSString (WSSComposedLength)

- (NSUInteger)WSSComposedLength;

@end

@implementation NSString (WSSComposedLength)

- (NSUInteger)WSSComposedLength
{
    __block NSUInteger length = 0;
    [self enumerateSubstringsInRange:(NSRange){0, [self length]}
                             options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired
                          usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
                              length++;
                          }];

    return length;
}

@end


@interface NSString (WSSLongestCommonSubsequence)

- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;

@end

@implementation NSString (WSSLongestCommonSubsequence)

- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target
{
    NSUInteger * const * scores;
    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];

    return scores[[target WSSComposedLength]][[self WSSComposedLength]];
}

- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target
{
    NSUInteger * const * scores;
    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];

    //FIXME: Implement this.

    return nil;
}

- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{

    NSUInteger selfLength = [self WSSComposedLength];
    NSUInteger targetLength = [target WSSComposedLength];
    NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];
    NSUInteger ** scores = [scoresData mutableBytes];

    for( NSUInteger i = 0; i <= targetLength; i++ ){
        scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];
    }

    /* Ranges in the enumeration Block are the same measure as
     * -[NSString length] -- i.e., 16-bit code units -- as opposed to
     * _composed_ length, which counts code points. Thus:
     *
     * Enumeration will miss the last character if composed length is used
     * as the range and there a substring range with length greater than one.
     */
    NSRange selfFullRange = (NSRange){0, [self length]};
    NSRange targetFullRange = (NSRange){0, [target length]};
    /* Have to keep track of these indexes by hand, rather than using the
     * Block substringRange.location because, e.g., @"o̶", will have
     * range {x, 2}, and the next substring will be {x+2, l}.
     */
    __block NSUInteger col = 0;
    __block NSUInteger row = 0;
    [target enumerateSubstringsInRange:targetFullRange
                             options:NSStringEnumerationByComposedCharacterSequences
                          usingBlock:^(NSString * targetSubstring,
                                       NSRange targetSubstringRange,
                                       NSRange _, BOOL * _0)
        {
            row++;
            col = 0;

            [self enumerateSubstringsInRange:selfFullRange
                                     options:NSStringEnumerationByComposedCharacterSequences
                                  usingBlock:^(NSString * selfSubstring,
                                               NSRange selfSubstringRange,
                                               NSRange _, BOOL * _0)
                {
                    col++;
                    NSUInteger newScore;
                    if( [selfSubstring isEqualToString:targetSubstring] ){

                        newScore = 1 + scores[row - 1][col - 1];
                    }
                    else {
                        NSUInteger upperScore = scores[row - 1][col];
                        NSUInteger leftScore = scores[row][col - 1];
                        newScore = MAX(upperScore, leftScore);
                    }

                    scores[row][col] = newScore;
                }];
        }];

    return scoresData;
}

@end

int main(int argc, const char * argv[])
{

    @autoreleasepool {

        NSArray * testItems = @[@{@"source" : @"Ingso̶ll",
                                  @"targets": @[
                                    @{@"string"   : @"Ingersoll",
                                      @"score"    : @6,
                                      @"sequence" : @"Ingsll"},
                                    @{@"string"   : @"Boylan",
                                      @"score"    : @1,
                                      @"sequence" : @"n"},
                                    @{@"string"   : @"New Ingersoll",
                                      @"score"    : @6,
                                      @"sequence" : @"Ingsll"}]},
                                @{@"source" : @"Ing",
                                  @"targets": @[
                                         @{@"string"   : @"Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"},
                                         @{@"string"   : @"Boylan",
                                           @"score"    : @1,
                                           @"sequence" : @"n"},
                                         @{@"string"   : @"New Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"}]},
                                @{@"source" : @"New Ing",
                                  @"targets": @[
                                         @{@"string"   : @"Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"},
                                         @{@"string"   : @"Boylan",
                                           @"score"    : @1,
                                           @"sequence" : @"n"},
                                         @{@"string"   : @"New Ingersoll",
                                           @"score"    : @7,
                                           @"sequence" : @"New Ing"}]}];

        for( NSDictionary * item in testItems ){
            NSString * source = item[@"source"];
            for( NSDictionary * target in item[@"targets"] ){
                NSString * targetString = target[@"string"];
                NSCAssert([target[@"score"] integerValue] ==
                           [source WSSLengthOfLongestCommonSubsequenceWithString:targetString],
                          @"");
//                NSCAssert([target[@"sequence"] isEqualToString:
//                           [source longestCommonSubsequenceWithString:targetString]],
//                          @"");
            }
        }


    }
    return 0;
}

Comparing multiple words with Levenshtein names

More articles: