Scatter plot kernel smoothing: ksmooth () does not smooth my data at all

Original question

I want to smooth out my explanatory variable, something like vehicle speed data, and then use these smoothed values. I searched a lot and did not find anything that would be a direct answer.

I know how to calculate the kernel density estimate ( density() or KernSmooth::bkde() ), but I don't know how to calculate smoothed speed values.


Re-edited Question

Thanks @ZheyuanLi, I can better explain what I have and what I want to do. So I re-edited my question as shown below.

I have some measure of vehicle speed for the time that is stored as a vehicle data frame:

  t speed 1 0 0.0000000 2 1 0.0000000 3 2 0.0000000 4 3 0.0000000 5 4 0.0000000 . . . . . . 1031 1030 4.8772222 1032 1031 4.4525000 1033 1032 3.2261111 1034 1033 1.8011111 1035 1034 0.2997222 1036 1035 0.2997222 

Here is a scatter plot:

scatter

I want to smooth speed from t , and I want to use kernel smoothing for this purpose. According to @Zheyuan's advice, I should use ksmooth() :

 fit <- ksmooth(vehicle$t, vehicle$speed) 

However, I found that the smoothed values ​​exactly match my original data:

 sum(abs(fit$y - vehicle$speed)) # 0 

Why is this happening? Thanks!

+5
source share
2 answers

The answer to the old question


You need to distinguish between "kernel density estimation" and "kernel smoothing."

Density estimation, works with only one variable. It aims to evaluate how this variable extends to its physical domain. For example, if we have 1000 normal samples:

 x <- rnorm(1000, 0, 1) 

We can estimate its distribution using the kernel density estimate:

 k <- density(x) plot(k); rug(x) 

density

The rugs on the x axis show the locations of your x values, and the curve measures the density of these rugs.

The kernel is smoother, in fact, the problem with regression or the problem of smoothing the spread. You need two variables: one response variable y and an explanatory variable x . Let us simply use x above for the explanatory variable. For response variable y we generate some toy values ​​from

 y <- sin(x) + rnorm(1000, 0, 0.2) 

Given the scatter plot between y and x :

scatter

we want to find a smooth function to approximate these scattered points.

Nadaraya-Watson core regression evaluation with R ksmooth() will help you:

 s <- ksmooth(x, y, kernel = "normal") plot(x,y, main = "kernel smoother") lines(s, lwd = 2, col = 2) 

ks

If you want to interpret everything in terms of forecast:

  • kernel density estimation: for a given x , predict density x ; those. we have a probability estimate P(grid[n] < x < grid[n+1]) , where grid are some support points;
  • kernel smoothing: given x , predict y ; those. we have an estimate of the function f(x) , which approximates y .

In both cases, you do not have the smoothed value of the explanatory variable x . Therefore, your question: “I want to smooth out my explanatory variable” does not make sense.


Do you have a time series?

Car Speed ​​sounds like you are tracking speed along time t . If so, get a scatter plot between speed and t and use ksmooth() .

Another anti-aliasing approach, such as loess() and smooth.spline() , does not belong to the kernel anti-aliasing class, but you can compare.

+8
source

Reply to a re-edited question

The default bandwidth for ksmooth() is 0.5:

  ksmooth(x, y, kernel = c("box", "normal"), bandwidth = 0.5, range.x = range(x), n.points = max(100L, length(x)), x.points) 

For time series data with a delay of 1, this means that in the vicinity (i-0.5, i+0.5) there will be no other speed data, for the time t = i , except for speed[i] . As a result, the local weighted value is not satisfied!

You need to choose a large bandwidth. For example, if we hope to average over 20 values, we should set bandwidth = 10 (not 20, since it is two-way). This is what we get:

 fit <- ksmooth(vehicle$t, vehicle$speed, bandwidth = 10) plot(vehicle, cex = 0.5) lines(fit,col=2,lwd = 2) 

enter image description here

Smoothness selection

One problem with ksmooth() is that you have to set bandwidth yourself. You can see that this parameter greatly changes the set curve. A large bandwidth makes the curve smooth, but far from the data; while the small bandwidth is inverse.

Is there an optimal bandwidth ? Is there any way to choose the best?

Yes, use sm.regression() from the sm package, with a cross-validation method to select the bandwidth.

 fit <- sm.regression(vehicle$t, vehicle$speed, method = "cv", eval.points = 0:1035) ## plot will be automatically generated! 

enter image description here

You can check that fit$h is 18.7.

Another approach

Perhaps you think sm.regression() oversaturated with your data? Well, use loess() , or my favorite: smooth.spline() .

I had an answer:

Here I would demonstrate the use of smooth.spline() :

 fit <- smooth.spline(vehicle$t, vehicle$speed, all.knots = TRUE, control.spar = list(low = -2, hight = 2)) # Call: # smooth.spline(x = vehicle$t, y = vehicle$speed, all.knots = TRUE, # control.spar = list(low = -2, hight = 2)) # Smoothing Parameter spar= 0.2519922 lambda= 4.379673e-11 (14 iterations) # Equivalent Degrees of Freedom (Df): 736.0882 # Penalized Criterion: 3.356859 # GCV: 0.03866391 plot(vehicle, cex = 0.5) lines(fit$x, fit$y, col = 2, lwd = 2) 

enter image description here

Or using its relational version of the spline:

 fit <- smooth.spline(vehicle$t, vehicle$speed, nknots = 200) plot(vehicle, cex = 0.5) lines(fit$x, fit$y, col = 2, lwd = 2) 

enter image description here

You really need to read my first link above to understand why I use control.spar in the first case, and without it in the second case.

More powerful package

I definitely recommend mgcv . I have several answers regarding mgcv , but I do not want to suppress you. Therefore, I will not redistribute here. Learn to make good use of ksmooth() , smooth.spline() and loess() . In the future, when you encounter a more complex problem, return to the stack overflow and ask for help!

+4
source

All Articles