A function similar to group_by when groups are not mutually exclusive

Question

A function similar to group_by when groups are not mutually exclusive

I would like to create a function in R similar to the dplyr group_by function, which in combination with summarise can provide summary statistics for a dataset where group membership is not mutually exclusive. I., observations can belong to several groups. One way to think about it is to consider tags; observations can belong to one or more tags that may overlap.

For example, take the R esoph dataset ( https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html ), documenting a case study of esophageal cancer. Suppose I’m interested in the number and proportion of cancer cases in general and by the “tag”, where the tags are: over 65; 80+ g / day of alcohol; 20+ g / day of tobacco; and a high-risk group in which the previous 3 criteria are met. Convert the data set to a long format (one member per line), and then add these tags (logical columns) to the data set:

 library('dplyr') data(esoph) esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1), esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0) ) %>% mutate(highage=(agegp %in% c('65-74','75+')), highalc=(alcgp %in% c('80-119','120+')), hightob=(tobgp %in% c('20-29','30+')), highrisk=(highage & highalc & hightob) )

My usual approach is to create a dataset where each observation is duplicated for each of its tags, and then summarise this dataset:

 esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'), esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'), esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'), esophlong %>% filter(highrisk) %>% mutate(tag='high risk'), esophlong %>% filter() %>% mutate(tag='all') ) %>% mutate(tag=factor(tag, levels = unique(.$tag))) summary = esophdup %>% group_by(tag) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case))

This approach is inefficient for large datasets or for a large number of tags, and I often will have insufficient memory to store it.

An alternative is to summarise each tag separately, and then bind these pivot data arrays as follows:

 summary.age = esophlong %>% filter(highage) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='age>=65') summary.alc = esophlong %>% filter(highalc) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='alc>=80') summary.tob = esophlong %>% filter(hightob) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='tob>=20') summary.highrisk = esophlong %>% filter(highrisk) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='high risk') summary.all = esophlong %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='all') summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)

This approach is time consuming and tedious when I have a large number of tags, or I want to reuse tags often for different total metrics throughout the project.

The function that I mean, say group_by_tags(data, key, ...) , which includes an argument to indicate the name of the grouping column, should work something like this:

 summary = esophlong %>% group_by_tags(key='tags', 'age>=65'=highage, 'alc>=80'=highalc, 'tob>=20'=hightob, 'high risk'=highrisk, 'all ages'=1 ) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case))

where the composite dataset is as follows:

 > summary tags n ncases case.rate 1 age>=65 273 68 0.2490842 2 alc>=80 301 96 0.3189369 3 tob>=20 278 64 0.2302158 4 high risk 11 5 0.4545455 5 all 1175 200 0.1702128

Moreover, it can take variables of the type “factor”, as well as “logical”, so that it can summarize, say, each age group individually, over 65 years old and all:

 summaryage = esophlong %>% group_by_tags(key='Age.group', agegp, '65+'=(agegp %in% c('65-74','75+')), 'all'=1 ) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) >summaryage Age.group n ncases case.rate 1 25-34 117 1 0.0085470 2 35-44 208 9 0.0432692 3 45-54 259 46 0.1776062 4 55-64 318 76 0.2389937 5 65-74 216 55 0.2546296 6 75+ 57 13 0.2280702 7 65+ 273 68 0.2490842 8 all 1175 200 0.1702128

This may not be possible with ... , and instead, you may need to pass a vector / list of column names for tags.

Any ideas?

EDIT: to be clear, the decision should accept the tag / group definitions and the required summary statistics as arguments, and not be built into the function itself. Either as a two-stage data %>% group_by_tags(tags) %>% summarise_tags(stats) , or as a one-step process data %>% summary_tags(tags,stats) .

+5

r group-by dplyr

wjchulme Aug 23 '16 at 16:56

source share

5 answers

This is a @eddi answer option. I take the definitions of highage et al as part of the function definition:

 library(data.table) custom_summary = function(DT, tags, stats){ setDT(DT) rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE])) DT[rows$values, eval(stats), by=.(tag = rows$ind)] }

And an example of use:

 data(esoph) library(dplyr) esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1), esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0) ) custom_summary( DT = esophlong, tags = quote(list( 'age>=65' = agegp %in% c('65-74','75+'), 'alc>=80' = alcgp %in% c('80-119','120+'), 'tob>=20' = tobgp %in% c('20-29','30+'), 'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))), 'all ages' = TRUE )), stats = quote(list( n = .N, n_cases = sum(case), case.rate = mean(case) )) ) tag n n_cases case.rate 1: age>=65 273 68 0.2490842 2: alc>=80 301 96 0.3189369 3: tob>=20 278 64 0.2302158 4: high risk 11 5 0.4545455 5: all ages 1175 200 0.1702128

The technique for using eval inside DT[...] is explained in the data section .

+3

Frank Aug 23 '16 at 21:26

source share

Not a fully functional answer , more "WIP" or start a conversation. This should ultimately go to the repo and either an additional package or PR for dplyr .

One way is to mimic the structure of attributes from a “normally” grouped variable:

 library(dplyr) esoph %>% group_by(agegp, alcgp) %>% attributes %>% str # List of 9 # $ names : chr [1:5] "agegp" "alcgp" "tobgp" "ncases" ... # $ row.names : int [1:88] 1 2 3 4 5 6 7 8 9 10 ... # $ class : chr [1:4] "grouped_df" "tbl_df" "tbl" "data.frame" # $ vars :List of 2 # ..$ : symbol agegp # ..$ : symbol alcgp # $ drop : logi TRUE # $ indices :List of 24 # ..$ : int [1:4] 0 1 2 3 # ..$ : int [1:4] 4 5 6 7 # ..$ : int [1:3] 8 9 10 # ........... # $ group_sizes : int [1:24] 4 4 3 4 4 4 4 3 4 4 ... # $ biggest_group_size: int 4 # $ labels :'data.frame': 24 obs. of 2 variables: # ..$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 2 2 2 2 3 3 ... # ..$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 2 3 4 1 2 3 4 1 2 ... # ..- attr(*, "vars")=List of 2 # .. ..$ : symbol agegp # .. ..$ : symbol alcgp # ..- attr(*, "drop")= logi TRUE

We can reproduce this artificially to see if / how this works:

 esoph2 <- esoph syms <- list(as.symbol("agegp65"), as.symbol("alcgp80")) attr(esoph2, "vars") <- syms attr(esoph2, "drop") <- TRUE # 'agegp' and 'aclgp' are ordered factors, for simplicity here just using ints # `group_by` indices are 0-based indices <- list( which(as.integer(esoph2$agegp) >= 5) - 1, which(as.integer(esoph2$alcgp) >= 3) - 1 ) attr(esoph2, "indices") <- indices attr(esoph2, "group_sizes") <- lengths(indices) attr(esoph2, "biggest_group_size") <- max(lengths(indices)) df <- data.frame(agegp65 = "agegp >= 65", alcgp80 = "alcgp >= 80", stringsAsFactors = FALSE) attr(df, "vars") <- syms attr(esoph2, "labels") <- df class(esoph2) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

Which "looks" like a regular grouped data.frame:

 str(esoph2) # Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 88 obs. of 5 variables: # $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ... # $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ... # $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ... # $ ncases : num 0 0 0 0 0 0 0 0 0 0 ... # $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ... # - attr(*, "vars")=List of 2 # ..$ : symbol agegp65 # ..$ : symbol alcgp80 # - attr(*, "drop")= logi TRUE # - attr(*, "indices")=List of 2 # ..$ : num 62 63 64 65 66 67 68 69 70 71 ... # ..$ : num 8 9 10 11 12 13 14 23 24 25 ... # - attr(*, "group_sizes")= int 26 42 # - attr(*, "biggest_group_size")= int 42 # - attr(*, "labels")='data.frame': 1 obs. of 2 variables: # ..$ agegp65: chr "agegp >= 65" # ..$ alcgp80: chr "alcgp >= 80" # ..- attr(*, "vars")=List of 2 # .. ..$ : symbol agegp65 # .. ..$ : symbol alcgp80 esoph2 # Source: local data frame [88 x 5] # Groups: agegp65, alcgp80 [2] # agegp alcgp tobgp ncases ncontrols # <ord> <ord> <ord> <dbl> <dbl> # 1 25-34 0-39g/day 0-9g/day 0 40 # 2 25-34 0-39g/day 10-19 0 10 # 3 25-34 0-39g/day 20-29 0 6 # 4 25-34 0-39g/day 30+ 0 5 # 5 25-34 40-79 0-9g/day 0 27 # 6 25-34 40-79 10-19 0 7 # 7 25-34 40-79 20-29 0 4 # 8 25-34 40-79 30+ 0 7 # 9 25-34 80-119 0-9g/day 0 2 # 10 25-34 80-119 10-19 0 1 # # ... with 78 more rows

Unfortunately:

 esoph2 %>% summarize(n = n()) # Error: corrupt 'grouped_df', contains 88 rows, and 68 rows in groups

Ergo my comment is that summarize assumes full coverage; you will have to change dplyr_summarise_impl (in C ++), perhaps make the third option summarise_grouped and summarise_not_grouped .

+1

r2evans Aug 23 '16 at 17:38

source share

 library(data.table) setDT(esophlong) special.summary = function(dt, vars) { rbindlist(lapply(seq_along(vars), function(i) { var = vars[[i]] if (is.logical(dt[, eval(var)])) { dt[eval(var) == TRUE, .(.N, sum(case), mean(case))][, tag := names(vars)[i]][ , .SD, by = tag] # last step is a lazy version of setcolorder } else { dt[, .(.N, sum(case), mean(case)), by = .(tag = eval(var))] } })) } special.summary(esophlong, list('age>=65'=quote(highage), 'alc>=80'=quote(highalc), 'tob>=20'=quote(hightob), 'high risk'=quote(highrisk), 'all'=quote(TRUE))) # tag N V2 V3 #1: age>=65 273 68 0.2490842 #2: alc>=80 301 96 0.3189369 #3: tob>=20 278 64 0.2302158 #4: high risk 11 5 0.4545455 #5: all 1175 200 0.1702128 special.summary(esophlong, list(quote(agegp), '65+'=quote(agegp %in% c('65-74','75+')), 'all'=quote(TRUE))) # tag N V2 V3 #1: 25-34 117 1 0.008547009 #2: 35-44 208 9 0.043269231 #3: 45-54 259 46 0.177606178 #4: 55-64 318 76 0.238993711 #5: 65-74 216 55 0.254629630 #6: 75+ 57 13 0.228070175 #7: 65+ 273 68 0.249084249 #8: all 1175 200 0.170212766

This can, of course, be made more customizable, and this remained as an exercise for the reader.

+1

eddi Aug 23 '16 at 20:53

source share

Here's the (mostly) dplyr version:

If columns are created by OP, tags can be:

 tags = list('age>=65'="highage", 'alc>=80'="highalc", 'tob>=20'="hightob", 'high risk'="highrisk", 'all'=TRUE)

But it’s better to create filter expressions from the source, like @Frank:

 tags1 = list( 'age>=65' = ~agegp %in% c('65-74','75+'), 'alc>=80' = ~alcgp %in% c('80-119','120+'), 'tob>=20' = ~tobgp %in% c('20-29','30+'), 'high risk' = ~agegp %in% c('65-74','75+') & alcgp %in% c('80-119','120+') & tobgp %in% c('20-29','30+'), 'all ages' = TRUE )

Then create a function that uses lapply to run the dplyr summary for each line of tags1 :

 my_summary = function(dat, groups) { bind_rows(lapply(1:length(groups), function(i) { dat %>% filter_(groups[[i]]) %>% summarise(tag=names(groups)[i], n=n(), ncases=sum(case), case.rate=mean(case)) })) } my_summary(esophlong, tags1)

  tag n ncases case.rate 1 age>=65 273 68 0.2490842 2 alc>=80 301 96 0.3189369 3 tob>=20 278 64 0.2302158 4 high risk 11 5 0.4545455 5 all 1175 200 0.1702128

I was hoping to create an easier way to generate filter expressions, but I'm still a little puzzled by how to create complex expressions for use in standard versions of dplyr functions.

For example, I will be interested in how to use an approach similar to the one below. The filt function filt intended to create a filter expression, but the return expression must be filter_ must have ~ before filter_ for proper interpretation. Or maybe there is some kind of interp binding that is required. In any case, I would be interested in how to make this work (or suggestions better), as well as how to create a filter with several conditions (as in the "high risk" filter) by combining individual filters:

 # Create a filtering expression filt = function(var, cutoff) { paste("as.numeric(gsub('([0-9]{1,3})[-+].*','\\1',", var, ")) >= ", cutoff) } # Run the summary function with three different filters plus "all" my_summary(esophlong, c(mapply(filt, c("agegp","alcgp","tobgp"), c(65,80,20)), 'all'=TRUE))

0

eipi10 Aug 23 '16 at 10:33

source share

wjchulme · Accepted Answer · 2018-02-16T18:42:53+0000

In the absence of any knowledge about the tidyverse internals, I avoided trying to create a group_by() -type function whose output should be passed to summarise() and instead make one function that combines both (similar to other answers, but I hope more convenient and universal).

Since group_by() %>% summarise() returns general summary information for each nested combination of grouping variables, I chose the name summarise_marginal() , since it will return the ultimate summary information for each grouping variable independently.

Solution that does not work with `grouped_df` objects

Firstly, a solution that does not work with grouped_df classes but extends below:

 summarise_marginal0 <- function(.tbl, .vars, ..., .removeF=FALSE){ dots <- quos(...) .tbl %>% transmute(!!! .vars) %>% map_dfr( ~ summarise(group_by(.tbl, 'value'=., add = TRUE), !!! dots) %>% # piping .tbl %>% group_by() %>% summarise() evaluates in the wrong order for some reason filter_at(vars('value'), all_vars(!(.==FALSE & .removeF))) %>% # to remove rows where a logical group is FALSE. mutate_at(vars('value'), as.character) # standardises 'value' column in case map_dfr tries to convert logical to factor , .id='group' ) } mtcars %>% summarise_marginal0( vars(cyl, am), meanmpg = mean(mpg), meanwt = mean(wt) ) #> # A tibble: 5 x 4 #> group value meanmpg meanwt #> <chr> <chr> <dbl> <dbl> #> 1 cyl 4 26.66364 2.285727 #> 2 cyl 6 19.74286 3.117143 #> 3 cyl 8 15.10000 3.999214 #> 4 am 0 17.14737 3.768895 #> 5 am 1 24.39231 2.411000

Capturing groups using vars() (for example, using summarise_at() or mutate_at() ) neatly separates groups from the resulting functions and allows you to create new groups on the fly:

 mtcars %>% summarise_marginal0( vars(cyl, hp_lt100 = hp<100), meanmpg = mean(mpg), meanwt = mean(wt) ) #> # A tibble: 5 x 4 #> group value meanmpg meanwt #> <chr> <chr> <dbl> <dbl> #> 1 cyl 4 26.66364 2.285727 #> 2 cyl 6 19.74286 3.117143 #> 3 cyl 8 15.10000 3.999214 #> 4 hp_lt100 FALSE 17.45217 3.569652 #> 5 hp_lt100 TRUE 26.83333 2.316667

We can use the .removeF argument to remove boolean values of FALSE . Useful if you want to summarize certain lines, but not their compliment:

 mtcars %>% summarise_marginal0( vars(cyl==6, hp_lt100 = hp<100, hp_lt200 = hp<200), meanmpg = mean(mpg), meanwt = mean(wt), .removeF = TRUE ) #> # A tibble: 3 x 4 #> group value meanmpg meanwt #> <chr> <chr> <dbl> <dbl> #> 1 cyl == 6 TRUE 19.74286 3.117143 #> 2 hp_lt100 TRUE 26.83333 2.316667 #> 3 hp_lt200 TRUE 21.96000 2.911320

Please note that even without explicitly identifying the group cyl == 6 we still get a useful name for it.

Solution that works with `grouped_df` objects

summarise_marginal0() can be extended to work with grouped_df objects returned by group_by() :

 summarise_marginal <- function(.tbl, .vars, ...){ dots <- quos(...) .tbl %>% nest() %>% mutate( summarised = map(data, ~summarise_marginal0(., .vars, !!! dots)) ) %>% unnest(summarised) %>% purrrlyr::slice_rows(group_vars(.tbl)) } mtcars %>% group_by(am) %>% summarise_marginal( vars(cyl, hp_lt100 = hp<100), meanmpg = mean(mpg), meanwt = mean(wt) ) #> # A tibble: 10 x 5 #> # Groups: am [2] #> am group value meanmpg meanwt #> <dbl> <chr> <chr> <dbl> <dbl> #> 1 1 cyl 4 28.07500 2.042250 #> 2 1 cyl 6 20.56667 2.755000 #> 3 1 cyl 8 15.40000 3.370000 #> 4 1 hp_lt100 FALSE 20.61429 2.756857 #> 5 1 hp_lt100 TRUE 28.80000 2.007500 #> 6 0 cyl 4 22.90000 2.935000 #> 7 0 cyl 6 19.12500 3.388750 #> 8 0 cyl 8 15.05000 4.104083 #> 9 0 hp_lt100 FALSE 16.06875 3.925250 #> 10 0 hp_lt100 TRUE 22.90000 2.935000

In fact, summarise_marginal() will work for both grouped and non- data.frame , so this function alone is suitable.

This is a useful solution, but provided that group_by() uses functions outside of summarise() , for example using nest() or do() , I think the idea of group_by_marginal() (or group_by_tag() or something else is better ) worth pursuing.

Some remaining issues:

The function must convert integer, factorial, and logical columns to characters so that their values fit well in the same values column. This is a small violation of the principles of accurate data, but does not differ from how gather() behaves.
Assuming that the function group_by_marginal() is possible, its output cannot be passed to mutate() without resolving the ambiguity of placing values from each group. From the above example, what value of meanmpg should be indicated in the line with cyl==4 and am==0 ? Both 26.66364 (from cyl==4 ) and 17.14737 (from am==0 ) are relevant. (Note that for group_by() %>% mutate() there is no ambiguity, as it will return a joint summary function for cyl==4 & am==0 ). Three possible options for group_by_marginal() %>% mutate() :
- This should be prohibited.
- It should create some columns, for example meanmpg_cyl and meanmpg_am .
- It should replicate rows for each group.
Speed. I am sure that my implementation of this concept is inefficient and can be improved.

Finally, to demonstrate the original task of the example:

 bind_rows( esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1), esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0) ) %>% summarise_marginal( vars(highage = agegp %in% c('65-74','75+'), highalc = alcgp %in% c('80-119','120+'), hightob = tobgp %in% c('20-29','30+'), highrisk = highage & highalc & hightob, all = 1), n=length(agegp), ncases=sum(case), case.rate=mean(case), .removeF=TRUE ) #> # A tibble: 5 x 5 #> group value n ncases case.rate #> <chr> <chr> <int> <dbl> <dbl> #> 1 highage TRUE 273 68 0.2490842 #> 2 highalc TRUE 301 96 0.3189369 #> 3 hightob TRUE 278 64 0.2302158 #> 4 highrisk TRUE 11 5 0.4545455 #> 5 all 1 1175 200 0.1702128

A function similar to group_by when groups are not mutually exclusive

Solution that does not work with grouped_df objects

Solution that works with grouped_df objects

More articles:

Solution that does not work with `grouped_df` objects

Solution that works with `grouped_df` objects