R: Stratified fraction of random samples of a unique identifier by a grouping variable

With the following sample data, I would like to draw a stratified random sample (for example, 40%) of the identifier โ€œIDโ€ from each level of the โ€œCohortโ€ factor:

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", "b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20" ), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, -20L)) 

I know how to draw a random number of lines using the following:

 library(dplyr) data %>% group_by(Cohort) %>% sample_n(size = 10) 

But my actual data is longitudinal, so I have several cases of the same identifier in each cohort and several cohorts of different sizes, so you need to choose a fraction of unique identifiers. Any help would be appreciated.

+6
source share
2 answers

Here is one way:

 data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) 

This will return all rows containing randomly selected identifiers. In other words, I assume that you have dimensions that go with each row, and that you want all dimensions for each sample identifier. (If you just need one line returned for each sample id, then @bramtayl's answer will do this.)

For instance:

 data = data.frame(rbind(data, data), value=rnorm(2*nrow(data))) data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) Cohort ID value (int) (fctr) (dbl) 1 1 a1 -0.92370760 2 1 a2 -0.37230655 3 1 a3 -1.27037502 4 1 a7 -0.34545295 5 2 b14 -2.08205561 6 2 b17 0.31393998 7 2 b18 -0.02250819 8 2 b19 0.53065857 9 2 b20 0.03924414 10 1 a1 -0.08275011 11 1 a2 -0.10036822 12 1 a3 1.42397042 13 1 a7 -0.35203237 14 2 b14 0.30422865 15 2 b17 -1.82008014 16 2 b18 1.67548568 17 2 b19 0.74324596 18 2 b20 0.27725794 
+8
source

Why not

 library(dplyr) data %>% select(ID, Cohort) %>% distinct %>% group_by(Cohort) %>% sample_frac(0.4) %>% left_join(data) 
+5
source

All Articles