Difference in 2D KDE using kde2d (R) and ksdensity2d (Matlab)

Question

Difference in 2D KDE using kde2d (R) and ksdensity2d (Matlab)

When trying to port some code from Matlab to R, I ran into a problem. The essence of the code is to estimate the density of the 2D core, and then perform some simple calculations using the estimate. In Matlab, KDE was calculated using the ksdensity2d.m function. In R, KDE calculation is done using kde2d from the MASS package. So let's say that I want to compute KDE and just add values (this is not what I intend to do, but it serves this purpose). In R, this can be done with

library(MASS) set.seed(1009) x <- sample(seq(1000, 2000), 100, replace=TRUE) y <- sample(seq(-12, 12), 100, replace=TRUE) kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12)) sum(kk$z)

which gives the answer 0.3932732. When using ksdensity2d in Matlab using the same exact data and conditions, the answer is 0.3776. From a look at the code for kde2d, I noticed that the bandwidth is divided by 4

  kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y))) { nx <- length(x) if (length(y) != nx) stop("data vectors must be the same length") if (any(!is.finite(x)) || any(!is.finite(y))) stop("missing or infinite values in the data are not allowed") if (any(!is.finite(lims))) stop("only finite values are allowed in 'lims'") n <- rep(n, length.out = 2L) gx <- seq.int(lims[1L], lims[2L], length.out = n[1L]) gy <- seq.int(lims[3L], lims[4L], length.out = n[2L]) h <- if (missing(h)) c(bandwidth.nrd(x), bandwidth.nrd(y)) else rep(h, length.out = 2L) if (any(h <= 0)) stop("bandwidths must be strictly positive") h <- h/4 ax <- outer(gx, x, "-")/h[1L] ay <- outer(gy, y, "-")/h[2L] z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), , nx))/(nx * h[1L] * h[2L]) list(x = gx, y = gy, z = z) }

A simple check to determine if the difference in bandwidth is the reason for the difference in results, then

  kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12)) sum(kk$z)

which gives 0.3768013 (which matches Matlab's answer).

So my question is: why does kde2d divide the bandwidth into four? (Or why not ksdensity2d?)

+5

r matlab kernel-density

mkr Jun 03 '15 at 17:04

source share

1 answer

alexforrence · Accepted Answer · 2015-06-04T03:24:09+0000

In the mirrored github source, lines 31-35:

 if (any(h <= 0)) stop("bandwidths must be strictly positive") h <- h/4 # for S bandwidth scale ax <- outer(gx, x, "-" )/h[1L] ay <- outer(gy, y, "-" )/h[2L]

and the help file for kde2d () , which suggests looking at the help file for bandwidth . It says:

... which all scale to the density-width argument, and therefore give answers four times as many.

But why?

density () says that the width argument exists for compatibility with S (the predecessor of R). Comments in source for density() read:

 ## S has width equal to the length of the support of the kernel ## except for the gaussian where it is 4 * sd. ## R has bw a multiple of the sd.

The default is Gaussian. If bw not specified and width is, width is replaced, for example,

 library(MASS) set.seed(1) x <- rnorm(1000, 10, 2) all.equal(density(x, bw = 1), density(x, width = 4)) # Only the call is different

However, since kde2d() apparently written to remain compatible with S (and suppose it was originally written by FOR S, given it in MASS), everything ends with four. After going to the appropriate section of the MASS in the book (about p. 126), it seems that they may have chosen four to achieve a balance between the smoothness and accuracy of the data.

In conclusion, I assume that kde2d() divides by four to stay consistent with the rest of MASS (and other things originally written for S), and that the way you go about things looks fine.

Difference in 2D KDE using kde2d (R) and ksdensity2d (Matlab)

More articles: