R: Stratified fraction of random samples of a unique identifier by a grouping variable

Question

R: Stratified fraction of random samples of a unique identifier by a grouping variable

With the following sample data, I would like to draw a stratified random sample (for example, 40%) of the identifier “ID” from each level of the “Cohort” factor:

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", "b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20" ), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, -20L))

I know how to draw a random number of lines using the following:

 library(dplyr) data %>% group_by(Cohort) %>% sample_n(size = 10)

But my actual data is longitudinal, so I have several cases of the same identifier in each cohort and several cohorts of different sizes, so you need to choose a fraction of unique identifiers. Any help would be appreciated.

+6

random r sampling dplyr

user3594490 Nov 21 '15 at 0:18

source share

2 answers

Why not

 library(dplyr) data %>% select(ID, Cohort) %>% distinct %>% group_by(Cohort) %>% sample_frac(0.4) %>% left_join(data)

+5

bramtayl Nov 21 '15 at 0:39

source share

eipi10 · Accepted Answer · 2015-11-21T00:34:27+0000

Here is one way:

 data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

This will return all rows containing randomly selected identifiers. In other words, I assume that you have dimensions that go with each row, and that you want all dimensions for each sample identifier. (If you just need one line returned for each sample id, then @bramtayl's answer will do this.)

For instance:

 data = data.frame(rbind(data, data), value=rnorm(2*nrow(data))) data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) Cohort ID value (int) (fctr) (dbl) 1 1 a1 -0.92370760 2 1 a2 -0.37230655 3 1 a3 -1.27037502 4 1 a7 -0.34545295 5 2 b14 -2.08205561 6 2 b17 0.31393998 7 2 b18 -0.02250819 8 2 b19 0.53065857 9 2 b20 0.03924414 10 1 a1 -0.08275011 11 1 a2 -0.10036822 12 1 a3 1.42397042 13 1 a7 -0.35203237 14 2 b14 0.30422865 15 2 b17 -1.82008014 16 2 b18 1.67548568 17 2 b19 0.74324596 18 2 b20 0.27725794

R: Stratified fraction of random samples of a unique identifier by a grouping variable

More articles: