How to access data.frame variable in dplyr pipeline via. programmatically?

Question

How to access data.frame variable in dplyr pipeline via. programmatically?

library(ggplot2) library(dplyr) library(scales) data <- data.frame(THEME_NAME = c(rep("A", 10), rep("B", 20), rep("C", 15))) data %>% group_by(THEME_NAME) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) %>% # THE NEXT LINE !!! # ggplot(., aes(x = reorder(THEME_NAME, desc(freq)), y = freq)) + geom_bar(stat="identity") + scale_y_continuous(labels=percent)

How can I link to THEME_NAME programmatically? I can do .$THEME_NAME , but I would like to call .[1] or select(., 1) or something like that?

The reason for this is that I would like to use this pipeline in a wider context - for example, passing a bunch of factor variables along this pipeline. Something like: vars.to.plot <- sapply(data, is.factor) , and then starting each vars.to.plot element through this pipeline.

+5

r ggplot2 dplyr

Jasonaizkalns Feb 13 '15 at 18:41

source share

3 answers

So, you need to configure the variable to save the name of the grouping variable, because the information about the variable "group by" is not stored in the tbl_df object after calling summarize() . You can do it

 varname<-"THEME_NAME" data %>% group_by_(varname) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) %>% ggplot(eval(bquote(aes(x=reorder(.(as.name(varname)), desc(freq)), y=freq)))) + geom_bar(stat="identity") + scale_y_continuous(labels=percent)

Use bquote() to dynamically create an aes() call. This is only necessary because of the reorder() step you want to do. Otherwise, it would be much easier with aes_string() or something.

If you always wanted to reorder based on the first column (this means that you will never group more than one variable), you could do

 data %>% group_by(THEME_NAME) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) %>% {ggplot(., eval(substitute(aes(x=reorder(X, desc(freq)), y=freq), list(X=as.name(names(.)[1]))))) + geom_bar(stat="identity") + scale_y_continuous(labels=percent)}

which does not require

+3

Mrflick Feb 13 '15 at 20:33

source share

As far as I can tell, this should be done in three parts. There are a few limitations that I have found that I would appreciate someone fixing if I am wrong.

 data <- data.frame(THEME_NAME = c(rep("A", 10), rep("B", 20), rep("C", 15))) my_var <- names(data)[1] df <- data %>% group_by_(my_var) %>% summarise(n = n()) %>% mutate(freq = n / sum(n)) %>% arrange(desc(freq)) df[[1]] <- factor(df[[1]], levels = unique(df[[1]])) ggplot(df, aes_string(x = my_var, y = "freq")) + geom_bar(stat="identity") + scale_y_continuous(labels=percent)

Trying to get all this, I ran into these problems:

It is not possible to prevent ggplot from arranging the x axis automatically without resetting your variable levels before calling. The only way to call ggplot is with reorder , which, as far as I know, cannot be used with aes_string .
Another idea I had was to use mutate to reset levels. You need to use the s_mutate function from dplyrExras to use strings, but resetting levels from a dataset with channels does not work for strings.

The statement will look with mutate like this (what BTW works):

 mutate(THEME_NAME = factor(THEME_NAME, levels=unique(THEME_NAME)))

but when using a version-accepting string, the levels remain unchanged:

 s_mutate(my_var = factor(my_var, levels = unique(my_var)))

+1

cdeterman Feb 13 '15 at 19:53

source share

Jasonaizkalns · Accepted Answer · 2015-02-13T22:01:21+0000

The ideas presented here are useful, but this is what I actually ended up with:

 library(ggplot2) library(dplyr) library(scales) data <- data.frame(THEME_NAME = c(rep("A", 10), rep("B", 20), rep("C", 15)), THEME_NAME_2 = c(rep("E", 5), rep("F", 40)), Non_Factor = 1:45) factor.vars <- sapply(data, is.factor) varnames <- names(data)[factor.vars] myReorder <- function(x) { factor(x, levels=names(sort(table(x), decreasing=TRUE))) } for (i in seq_along(varnames)) { data[, varnames[i]] <- myReorder(data[, varnames[i]]) } for (i in seq_along(varnames)) { print(ggplot(data, aes_string(x = varnames[i], y = "..count../sum(..count..)")) + geom_histogram()) }

How to access data.frame variable in dplyr pipeline via. programmatically?

More articles: