When trying to port some code from Matlab to R, I ran into a problem. The essence of the code is to estimate the density of the 2D core, and then perform some simple calculations using the estimate. In Matlab, KDE was calculated using the ksdensity2d.m function. In R, KDE calculation is done using kde2d from the MASS package. So let's say that I want to compute KDE and just add values โโ(this is not what I intend to do, but it serves this purpose). In R, this can be done with
library(MASS) set.seed(1009) x <- sample(seq(1000, 2000), 100, replace=TRUE) y <- sample(seq(-12, 12), 100, replace=TRUE) kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12)) sum(kk$z)
which gives the answer 0.3932732. When using ksdensity2d in Matlab using the same exact data and conditions, the answer is 0.3776. From a look at the code for kde2d, I noticed that the bandwidth is divided by 4
kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y))) { nx <- length(x) if (length(y) != nx) stop("data vectors must be the same length") if (any(!is.finite(x)) || any(!is.finite(y))) stop("missing or infinite values in the data are not allowed") if (any(!is.finite(lims))) stop("only finite values are allowed in 'lims'") n <- rep(n, length.out = 2L) gx <- seq.int(lims[1L], lims[2L], length.out = n[1L]) gy <- seq.int(lims[3L], lims[4L], length.out = n[2L]) h <- if (missing(h)) c(bandwidth.nrd(x), bandwidth.nrd(y)) else rep(h, length.out = 2L) if (any(h <= 0)) stop("bandwidths must be strictly positive") h <- h/4 ax <- outer(gx, x, "-")/h[1L] ay <- outer(gy, y, "-")/h[2L] z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), , nx))/(nx * h[1L] * h[2L]) list(x = gx, y = gy, z = z) }
A simple check to determine if the difference in bandwidth is the reason for the difference in results, then
kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12)) sum(kk$z)
which gives 0.3768013 (which matches Matlab's answer).
So my question is: why does kde2d divide the bandwidth into four? (Or why not ksdensity2d?)