Using ddply to apply a function to a group of strings

I use ddply quite a bit, but I do not consider myself an expert. I have a data frame (df) with a grouping variable β€œGroup” that has values β€‹β€‹β€œA”, β€œB” and β€œC” and a variable for summing, β€œVar” has numerical values. If i use

ddply(df, .(Group), summarize, mysum=sum(Var)) 

then I get the sum of each A, B and C, which is correct. But what I want to do is to summarize each group of group variables as they are located in a data frame. For example, if a data frame has

 Group Var A 1.3 A 1.2 A 0.4 B 0.3 B 1.3 C 1.5 C 1.7 C 1.9 A 2.1 A 2.4 B 6.7 

Desired Result

 A 2.9 B 1.6 C 5.1 A 4.5 B 6.7 

Thus, the desired result performs a mathematical function for each grouping of group variables, and not for all instances of individual group variables. Can this be done in ddply?

Data

 dat <- structure(list(Group = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "B"), Var = c(1.3, 1.2, 0.4, 0.3, 1.3, 1.5, 1.7, 1.9, 2.1, 2.4, 6.7)), .Names = c("Group", "Var"), class = "data.frame", row.names = c(NA, -11L)) 
+6
source share
2 answers

Here is one way to do this using the recently implemented rleid() function from data.table v1.9.6. See # 686 .

This generates grouping identifiers as needed:

 require(data.table) ## v1.9.6+ DT = as.data.table(dat) rleid(DT$Group) # [1] 1 1 1 2 2 3 3 3 4 4 5 

We can use this directly for aggregation as follows:

 DT[, .(sum=sum(Var)), by=.(Group, rleid(Group))] # Group rleid sum # 1: A 1 2.9 # 2: B 2 1.6 # 3: C 3 5.1 # 4: A 4 4.5 # 5: B 5 6.7 

NTN

+9
source

Here will be the base equivalent

 dat <- structure(list(Group = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "B"), Var = c(1.3, 1.2, 0.4, 0.3, 1.3, 1.5, 1.7, 1.9, 2.1, 2.4, 6.7)), .Names = c("Group", "Var"), class = "data.frame", row.names = c(NA, -11L)) with(dat, cumsum(c(1L, Group[-length(Group)] != Group[-1]))) # [1] 1 1 1 2 2 3 3 3 4 4 5 

As a function

 rleid <- function(x) cumsum(c(1L, x[-length(x)] != x[-1])) (dat <- within(dat, id <- rleid(Group))) # Group Var id # 1 A 1.3 1 # 2 A 1.2 1 # 3 A 0.4 1 # 4 B 0.3 2 # 5 B 1.3 2 # 6 C 1.5 3 # 7 C 1.7 3 # 8 C 1.9 3 # 9 A 2.1 4 # 10 A 2.4 4 # 11 B 6.7 5 

aggregate based on new variable

 aggregate(Var ~ ., dat, sum) # Group id Var # 1 A 1 2.9 # 2 B 2 1.6 # 3 C 3 5.1 # 4 A 4 4.5 # 5 B 5 6.7 

Alternatively, you can actually use rle , but this requires an atomic vector, so if you use a coefficient, you need an extra step (i.e. as.vector )

 rleid2 <- function(x) { x <- as.vector(x) rep(seq_along(rle(x)$values), rle(x)$lengths) } rleid2(dat$Group) # [1] 1 1 1 2 2 3 3 3 4 4 5 

Some guidelines:

 set.seed(1) dat2 <- dat[sample(1:nrow(dat), 1e6, TRUE), ] identical(data.table::rleid(dat2$Group), rleid(dat2$Group)) # [1] TRUE library('microbenchmark') microbenchmark(data.table::rleid(dat2$Group), rleid(dat2$Group), rleid2(dat2$Group), unit = 'relative') # Unit: relative # expr min lq mean median uq max neval cld # data.table::rleid(dat2$Group) 1.032777 1.015395 1.005023 1.020923 1.000612 0.8935531 100 a # rleid(dat2$Group) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a # rleid2(dat2$Group) 35.747987 35.351585 28.600030 34.058992 33.147546 9.8786083 100 b 
+3
source

All Articles