Distance Canberra - Inconsistent Results

Question

Distance Canberra - Inconsistent Results

I am trying to understand what is happening with my calculation of the Canberra distance. I am writing my own simple canberra.distance function, however the results are not consistent with the dist function. I added the na.rm = T function to my function to be able to calculate the sum with a zero denominator. From ?dist I understand that they use a similar approach: Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.

 canberra.distance <- function(a, b){ sum( (abs(a - b)) / (abs(a) + abs(b)), na.rm = T ) } a <- c(0, 1, 0, 0, 1) b <- c(1, 0, 1, 0, 1) canberra.distance(a, b) > 3 # the result that I expected dist(rbind(a, b), method = "canberra") > 3.75 a <- c(0, 1, 0, 0) b <- c(1, 0, 1, 0) canberra.distance(a, b) > 3 # the result that I expected dist(rbind(a, b), method = "canberra") > 4 a <- c(0, 1, 0) b <- c(1, 0, 1) canberra.distance(a, b) > 3 dist(rbind(a, b), method = "canberra") > 3 # now the results are the same

Pairs 0-0 and 1-1 seem problematic. In the first case (0-0), both the numerator and the denominator are equal to zero, and this pair should be omitted. In the second case (1-1), the numerator is 0, and the denominator is not, and then it is also 0, and the sum should not change.

What am I missing here?

EDIT: To meet the definition of R, the canberra.distance function can be modified as follows:

 canberra.distance <- function(a, b){ sum( abs(a - b) / abs(a + b), na.rm = T ) }

However, the results are the same as before.

+5

r distance

Adela Aug 11 '16 at 11:07

source share

1 answer

ekstroem · Answer 1 · 2016-08-11T12:51:15+0000

This can shed light on the difference. As far as I can see, this is the actual code that is executed to calculate the distance

 static double R_canberra(double *x, int nr, int nc, int i1, int i2) { double dev, dist, sum, diff; int count, j; count = 0; dist = 0; for(j = 0 ; j < nc ; j++) { if(both_non_NA(x[i1], x[i2])) { sum = fabs(x[i1] + x[i2]); diff = fabs(x[i1] - x[i2]); if (sum > DBL_MIN || diff > DBL_MIN) { dev = diff/sum; if(!ISNAN(dev) || (!R_FINITE(diff) && diff == sum && /* use Inf = lim x -> oo */ (int) (dev = 1.))) { dist += dev; count++; } } } i1 += nr; i2 += nr; } if(count == 0) return NA_REAL; if(count != nc) dist /= ((double)count/nc); return dist; }

I think the culprit is this line

 if(!ISNAN(dev) || (!R_FINITE(diff) && diff == sum && /* use Inf = lim x -> oo */ (int) (dev = 1.)))

which handles a special case and may not be documented.

Distance Canberra - Inconsistent Results

More articles: