Building an internal function: a subset (df, id _ == ...) gives an incorrect schedule, df [df $ id _ == ...,] is right

Question

Building an internal function: a subset (df, id _ == ...) gives an incorrect schedule, df [df $ id _ == ...,] is right

I have a df with several y-series that I want to display individually, so I wrote fn that selects one specific series, assigning a local variable to dat , and then displays it. However, ggplot / geom_step when called inside fn does not handle it properly, as a separate series. I don’t see how this can be a problem, since if dat not visible, then ggplot probably failed?

You can verify the correctness of the code when executed from the toplevel environment, but not inside the function. This is not a duplicate question. I understand the problem (this is a recurring problem with ggplot), but I read all the other answers; this is not a duplicate, and they do not provide a solution. geom_step doesn't display it properly like a single series

 set.seed(1234) require(ggplot2) require(scales) N = 10 df <- data.frame(x = 1:N, id_ = c(rep(20,N), rep(25,N), rep(33,N)), y = c(runif(N, 1.2e6, 2.9e6), runif(N, 5.8e5, 8.9e5) ,runif(N, 2.4e5, 3.3e5)), row.names=NULL) plot_series <- function(id_, envir=environment()) { dat <- subset(df,id_==id_) p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step() # Unsuccessfully trying the approach from http://stackoverflow.com/questions/22287498/scoping-of-variables-in-aes-inside-a-function-in-ggplot p$plot_env <- envir plot(p) # Displays wrongly whether we do the plot here inside fn, or return the object to parent environment return(p) } # BAD: doesn't plot geom_step! plot_series(20) # GOOD! but what causing the difference? ggplot(data=subset(df,id_==20), mapping=aes(x,y), color='red') + geom_step() #plot_series(25) #plot_series(33)

+1

function r indexing evaluation subset

smci May 05 '14 at 21:15

source share

2 answers

Despite the comments, this works:

 plot_series <- function(z, envir=environment()) { dat <- subset(df,id_==z) p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step() p$plot_env <- envir plot(p) # Displays wrongly whether we do the plot here inside fn, or return the object to parent environment return(p) } plot_series(20)

The problem is that the subset interprets the id_ in RHS == as identical to LHS, as this is equivalent to a sublease on T , which of course includes all df lines. This is the plot you see.

+3

jlhoward May 05, '14 at 21:43

source share

joran · Accepted Answer · 2014-05-05T21:46:09+0000

This works great:

 plot_series <- function(id_) { dat <- df[df$id_ == id_,] p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step() return(p) } print(plot_series(20))

If you simply go to the original function using debug , you will quickly see that the subset line did not multiply the data frame at all: it returned all the lines!

Why? Since subset uses a non-standard evaluation, and you used the same name for the column name and function argument. As jlhoward shows above, this worked (but probably was not advisable) to just use different names for the two.

The reason is that subset first evaluates the data frame. Thus, everything that he sees in a logical expression is always true id_ == id_ inside this data frame.

One way to think about this is to play dumb (like a computer) and ask yourself when the id_ == id_ condition is id_ == id_ , as you know what each character specifically refers to. This is ambiguous, and subset makes a consistent choice: use what is in the data frame.

Building an internal function: a subset (df, id _ == ...) gives an incorrect schedule, df [df $ id _ == ...,] is right

More articles: