Wrong group values ​​are used when using plot () in data.table () in RStudio

I want to create a split chart. The values ​​of group a on the top of the diagram, and the values ​​of group b on the bottom. I am using data.table() for this. Here is the code I used to generate the example and configure the graphic output:

 library(data.table) set.seed(23) Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10)) layout(1:2) par('mai' = rep(.5, 4)) 

When you run the following lines in a regular console r, the correct values ​​are used to plot the graph. When running the same code in Rstudio, the values ​​for the second group are used for both diagrams:

 Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1 Example[, .SD[plot(value, ylim = c(0, 1))], by = group] # Example 2 

When adding a comma to a subset of data.table .SD[] Example 2 in Rstudio, the correct output is also generated:

 Example[, .SD[, plot(value, ylim = c(0, 1))], by = group] # Example 3 

When using barplot() rather than plot() Rstudio also uses the correct values:

 Example[, barplot(value, ylim = c(0, 1)), by = group] # Example 4 

Am I missing something or is this a mistake?

System: Windows 7, Rstudio Desktop v0.98.1091, R 3.1.2, data.table 1.9.4

+7
r rstudio data.table
source share
1 answer

Good catch (+ 1'd already)! In my case, Example 3 also does not create the desired plot (OS X 10.10.1, R 3.1.2, Rstudio 0.98.1091).

The only difference between the R console / GUI and Rstudio is the plotter. RStudio seems to use its own RstudioGD graphics device, where Quartz used in the R / GUI console.

graphics:::plot.default , I was able to narrow down the problem to the plot.xy() function. This function calls up various graphics devices (as shown above).

Initializing, for example, Quartz , calling the quartz() function and then running your code works fine!

FWIW this problem can be reproduced using dplyr() :

 require(dplyr) df = as.data.frame(Example) my_fun = function(x) {plot(x, ylim=c(0,1)); 1L } df %>% group_by(group) %>% summarise(my_fun(value)) 

will lead to the same wrong schedule.

This is most likely due to the way subgroups are handled in data.table (and I think dplyr should do it the same as data.table), which you can see by:

 Example[, print(sapply(.SD, address)), by=group] # value # "0x105bbf5b8" # value # "0x105bbf5b8" # Empty data.table (0 rows) of 1 col: group 

data.table assigns the largest group for .SD and internally reuses this memory for each subgroup to avoid .SD / dealloc memory for efficiency. Not sure (shooting in the dark here), but it seems that RstudioGD does not release the pointer associated with the subgroup, and as the data in the subgroup updates, the plot also updates. You can verify this by doing:

 # on RstudioGD debug(graphics:::plot.default) set.seed(23) Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10)) layout(1:2) par('mai' = rep(.5, 4)) Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1 undebug(graphics:::plot.default) 

Keep going in and you'll see that the first plot is plotted correctly .. and when the second plot is added, the first plot changes. This may be due to recent changes in Rv3.1 +, where there are shallow copies of the function arguments rather than deep copying (again, shooting in the dark here).

You can temporarily fix this by explicitly copying value :

 Example[, plot(copy(value), ylim = c(0, 1)), by = group] # Example 1 

will create the correct schedule.

+7
source share

All Articles