Wrong group values are used when using plot () in data.table () in RStudio

Question

Wrong group values are used when using plot () in data.table () in RStudio

I want to create a split chart. The values of group a on the top of the diagram, and the values of group b on the bottom. I am using data.table() for this. Here is the code I used to generate the example and configure the graphic output:

 library(data.table) set.seed(23) Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10)) layout(1:2) par('mai' = rep(.5, 4))

When you run the following lines in a regular console r, the correct values are used to plot the graph. When running the same code in Rstudio, the values for the second group are used for both diagrams:

 Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1 Example[, .SD[plot(value, ylim = c(0, 1))], by = group] # Example 2

When adding a comma to a subset of data.table .SD[] Example 2 in Rstudio, the correct output is also generated:

 Example[, .SD[, plot(value, ylim = c(0, 1))], by = group] # Example 3

When using barplot() rather than plot() Rstudio also uses the correct values:

 Example[, barplot(value, ylim = c(0, 1)), by = group] # Example 4

Am I missing something or is this a mistake?

System: Windows 7, Rstudio Desktop v0.98.1091, R 3.1.2, data.table 1.9.4

+7

r rstudio data.table

Jonas Dec 16 '14 at 13:29

source share

1 answer

Arun · Accepted Answer · 2014-12-16T16:18:32+0000

Good catch (+ 1'd already)! In my case, Example 3 also does not create the desired plot (OS X 10.10.1, R 3.1.2, Rstudio 0.98.1091).

The only difference between the R console / GUI and Rstudio is the plotter. RStudio seems to use its own RstudioGD graphics device, where Quartz used in the R / GUI console.

graphics:::plot.default , I was able to narrow down the problem to the plot.xy() function. This function calls up various graphics devices (as shown above).

Initializing, for example, Quartz , calling the quartz() function and then running your code works fine!

FWIW this problem can be reproduced using dplyr() :

 require(dplyr) df = as.data.frame(Example) my_fun = function(x) {plot(x, ylim=c(0,1)); 1L } df %>% group_by(group) %>% summarise(my_fun(value))

will lead to the same wrong schedule.

This is most likely due to the way subgroups are handled in data.table (and I think dplyr should do it the same as data.table), which you can see by:

 Example[, print(sapply(.SD, address)), by=group] # value # "0x105bbf5b8" # value # "0x105bbf5b8" # Empty data.table (0 rows) of 1 col: group

data.table assigns the largest group for .SD and internally reuses this memory for each subgroup to avoid .SD / dealloc memory for efficiency. Not sure (shooting in the dark here), but it seems that RstudioGD does not release the pointer associated with the subgroup, and as the data in the subgroup updates, the plot also updates. You can verify this by doing:

 # on RstudioGD debug(graphics:::plot.default) set.seed(23) Example <- data.table('group' = rep(c('a', 'b'), each = 5), 'value' = runif(10)) layout(1:2) par('mai' = rep(.5, 4)) Example[, plot(value, ylim = c(0, 1)), by = group] # Example 1 undebug(graphics:::plot.default)

Keep going in and you'll see that the first plot is plotted correctly .. and when the second plot is added, the first plot changes. This may be due to recent changes in Rv3.1 +, where there are shallow copies of the function arguments rather than deep copying (again, shooting in the dark here).

You can temporarily fix this by explicitly copying value :

 Example[, plot(copy(value), ylim = c(0, 1)), by = group] # Example 1

will create the correct schedule.

Wrong group values ​​are used when using plot () in data.table () in RStudio

More articles:

Wrong group values are used when using plot () in data.table () in RStudio