I would like to create a function in R similar to the dplyr group_by function, which in combination with summarise can provide summary statistics for a dataset where group membership is not mutually exclusive. I., observations can belong to several groups. One way to think about it is to consider tags; observations can belong to one or more tags that may overlap.
For example, take the R esoph dataset ( https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html ), documenting a case study of esophageal cancer. Suppose I’m interested in the number and proportion of cancer cases in general and by the “tag”, where the tags are: over 65; 80+ g / day of alcohol; 20+ g / day of tobacco; and a high-risk group in which the previous 3 criteria are met. Convert the data set to a long format (one member per line), and then add these tags (logical columns) to the data set:
library('dplyr') data(esoph) esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1), esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0) ) %>% mutate(highage=(agegp %in% c('65-74','75+')), highalc=(alcgp %in% c('80-119','120+')), hightob=(tobgp %in% c('20-29','30+')), highrisk=(highage & highalc & hightob) )
My usual approach is to create a dataset where each observation is duplicated for each of its tags, and then summarise this dataset:
esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'), esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'), esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'), esophlong %>% filter(highrisk) %>% mutate(tag='high risk'), esophlong %>% filter() %>% mutate(tag='all') ) %>% mutate(tag=factor(tag, levels = unique(.$tag))) summary = esophdup %>% group_by(tag) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case))
This approach is inefficient for large datasets or for a large number of tags, and I often will have insufficient memory to store it.
An alternative is to summarise each tag separately, and then bind these pivot data arrays as follows:
summary.age = esophlong %>% filter(highage) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='age>=65') summary.alc = esophlong %>% filter(highalc) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='alc>=80') summary.tob = esophlong %>% filter(hightob) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='tob>=20') summary.highrisk = esophlong %>% filter(highrisk) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='high risk') summary.all = esophlong %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>% mutate(tag='all') summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)
This approach is time consuming and tedious when I have a large number of tags, or I want to reuse tags often for different total metrics throughout the project.
The function that I mean, say group_by_tags(data, key, ...) , which includes an argument to indicate the name of the grouping column, should work something like this:
summary = esophlong %>% group_by_tags(key='tags', 'age>=65'=highage, 'alc>=80'=highalc, 'tob>=20'=hightob, 'high risk'=highrisk, 'all ages'=1 ) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case))
where the composite dataset is as follows:
> summary tags n ncases case.rate 1 age>=65 273 68 0.2490842 2 alc>=80 301 96 0.3189369 3 tob>=20 278 64 0.2302158 4 high risk 11 5 0.4545455 5 all 1175 200 0.1702128
Moreover, it can take variables of the type “factor”, as well as “logical”, so that it can summarize, say, each age group individually, over 65 years old and all:
summaryage = esophlong %>% group_by_tags(key='Age.group', agegp, '65+'=(agegp %in% c('65-74','75+')), 'all'=1 ) %>% summarise(n=n(), ncases=sum(case), case.rate=mean(case)) >summaryage Age.group n ncases case.rate 1 25-34 117 1 0.0085470 2 35-44 208 9 0.0432692 3 45-54 259 46 0.1776062 4 55-64 318 76 0.2389937 5 65-74 216 55 0.2546296 6 75+ 57 13 0.2280702 7 65+ 273 68 0.2490842 8 all 1175 200 0.1702128
This may not be possible with ... , and instead, you may need to pass a vector / list of column names for tags.
Any ideas?
EDIT: to be clear, the decision should accept the tag / group definitions and the required summary statistics as arguments, and not be built into the function itself. Either as a two-stage data %>% group_by_tags(tags) %>% summarise_tags(stats) , or as a one-step process data %>% summary_tags(tags,stats) .