In R ggplot2 include stat_ecdf () endpoints (0,0) and (1,1)

Question

In R ggplot2 include stat_ecdf () endpoints (0,0) and (1,1)

I am trying to use stat_ecdf() to build cumulative successes depending on the rank score created by the predictive model.

 #libraries require(ggplot2) require(scales) # fake data for reproducibility set.seed(123) n <- 200 df <- data.frame(model_score= rexp(n=n,rate=1:n), obs_set= sample(c("training","validation"),n,replace=TRUE)) df$model_rank <- rank(df$model_score)/n df$target_outcome <- rbinom(n,1,1-df$model_rank) # Plot Gain Chart using stat_ecdf() ggplot(subset(df,target_outcome==1),aes(x = model_rank)) + stat_ecdf(aes(colour = obs_set), size=1) + scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) + xlab("Model Percentile") + ylab("Percent of Target Outcome") + scale_y_continuous(limits=c(0,1), labels=percent) + geom_segment(aes(x=0,y=0,xend=1,yend=1), colour = "gray", linetype="longdash", size=1) + ggtitle("Gain Chart")

enter image description here

All I want to do is make ECDF start with (0,0) and end with (1,1) so that there are no spaces at the beginning or end of the curve. If possible, I would like to do this in the ggplot2 syntax, but I would agree to a smart workaround.

@Henrik is NOT a duplicate of this question , because I already defined my limits using scale_x_ and _y_continuous() , and adding expand_limits() doesn’t do anything. This is not the origin of PLOT, but the stat_ecdf () endpoints that need to be fixed.

+5

r ggplot2 ecdf

C8H10N4O2 Feb 19 '15 at 14:59

source share

1 answer

user295691 · Accepted Answer · 2015-08-05T14:54:21+0000

Unfortunately, the definition of stat_ecdf does not provide room for maneuver; it defines the endpoints inside.

There is a somewhat advanced solution. With the latest version of ggplot2 ( devtools::install_github("hadley/ggplot2") ), extensibility expands to such an extent that you can override this behavior, but not without some pattern.

 stat_ecdf2 <- function(mapping = NULL, data = NULL, geom = "step", position = "identity", n = NULL, show.legend = NA, inherit.aes = TRUE, minval=NULL, maxval=NULL,...) { layer( data = data, mapping = mapping, stat = StatEcdf2, geom = geom, position = position, show.legend = show.legend, inherit.aes = inherit.aes, stat_params = list(n = n, minval=minval,maxval=maxval), params = list(...) ) } StatEcdf2 <- ggproto("StatEcdf2", StatEcdf, calculate = function(data, scales, n = NULL, minval=NULL, maxval=NULL, ...) { df <- StatEcdf$calculate(data, scales, n, ...) if (!is.null(minval)) { df$x[1] <- minval } if (!is.null(maxval)) { df$x[length(df$x)] <- maxval } df } )

Now stat_ecdf2 will behave the same as stat_ecdf , but with the additional parameter minval and maxval . So this will be the trick:

 ggplot(subset(df,target_outcome==1),aes(x = model_rank)) + stat_ecdf2(aes(colour = obs_set), size=1, minval=0, maxval=1) + scale_x_continuous(limits=c(0,1), labels=percent,breaks=seq(0,1,.1)) + xlab("Model Percentile") + ylab("Percent of Target Outcome") + scale_y_continuous(limits=c(0,1), labels=percent) + geom_segment(aes(x=0,y=0,xend=1,yend=1), colour = "gray", linetype="longdash", size=1) + ggtitle("Gain Chart")

The big caveat here is that I don't know if the current extensibility model will be supported in the future; it has changed several times in the past, and a change in the use of "ggproto" is recent - like July 15, 2015.

As a plus, it gave me a chance to really understand the internal functions of ggplot, which I would like to do for a while.

In R ggplot2 include stat_ecdf () endpoints (0,0) and (1,1)

More articles: