Levenshtein Algorithm with Numeric Vectors

Question

Levenshtein Algorithm with Numeric Vectors

I have two vectors with numeric values. For instance,

v1 <- c(1, 3, 4, 5, 6, 7, 8) v2 <- c(54, 23, 12, 53, 7, 8)

I would like to calculate the number of attachments , deletions and replacements. that I need to turn one vector into another with certain costs for the operation c1 c2 and c3, respectively. I know that the adist function in the base package does this for strings, but I don’t know the equivalent function with numbers.

I thought about links to each number with a letter, but I have more than 2000 unique numbers, so if anyone knows how to get 2000 different characters in R, that would also be a solution for me.

Thank you for your help.

+6

r levenshtein distance

Osbi May 15, '14 at 9:17

source share

1 answer

gagolews · Accepted Answer · 2014-05-15T10:00:32+0000

An Integer vector can be thought of as a single string encoded in UTF-32 (in which one Unicode code point is represented as one 32-bit integer). You can get a “normal” string by simply converting such a vector to UTF-8 using intToUtf8 .

 intToUtf8(c(65, 97)) ## [1] "Aa"

By the way, adist defaults to utf8ToInt (reverse op) by its inputs. Thus, it calculates the results by integer vectors. No big hack.

This is the solution.

 adist(intToUtf8(c(1, 3, 4, 5, 6, 7, 8)), intToUtf8(c(54, 23, 12, 53, 7, 8)), counts=TRUE) ## [,1] ## [1,] 5 ## attr(,"counts") ## , , ins ## ## [,1] ## [1,] 0 ## ## , , del ## ## [,1] ## [1,] 1 ## ## , , sub ## ## [,1] ## [1,] 4 ## ## attr(,"trafos") ## [,1] ## [1,] "SSSSDMM"

The above code should work if at least all numbers are strictly greater than 0. R handles Unicode code codes rather liberally (actually, too liberal, but in this case you are a winner), even the maximum possible integer is accepted:

 utf8ToInt(intToUtf8(c(2147483647))) ## 2147483647

If you have a vector with negative values, you can somehow transform it, for example. with x <- x-min(x)+1 .

If you need different costs for inserting, deleting, replacing, check the adist's costs argument. There is also a package called stringdist that includes many other string metrics. The above scheme should also work there.

Levenshtein Algorithm with Numeric Vectors

More articles: