How is Levenshtein distance calculated on simplified hieroglyphs?

I have 2 questions:

    query1:你好世界
    query2:你好

When I run this code using the python Levenshtein library:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get the conclusion from 12. Now the question is, how is the value of 12 obtained?

Because in terms of the difference between the strokes, they are definitely greater than 12.

+4
source share
1 answer

According to its documentation, it supports unicode:

It supports both regular and Unicode strings, but cannot mix them, all arguments of a function (method) must be of the same type (or its subclasses).

You need to make sure that Chinese characters are in Unicode:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2
+3
source

All Articles