Using Tidyr / Dplyr to Summarize String Group Counts

I need to summarize the line counts that I assign to groups, and I know that I can do this in dplyr / tidyr, but I am missing something.

Dataset example:

Owner = c('bob','julia','cheryl','bob','julia','cheryl') Day = c('Mon', 'Tue') Locn = c('house','store','apartment','office','house','shop') data <- data.frame(Owner, Day, Locn) 

which is as follows:

  Owner Day Locn 1 bob Mon house 2 julia Tue store 3 cheryl Mon apartment 4 bob Tue office 5 julia Mon house 6 cheryl Tue shop 

I want to group by name and day, and then count grouped locations in columns. In this example, I want the “home” and “apartment” to add the “Home” and “shop”, “office” and “shop” columns to be counted in the “Work” column.

My current code (which does not work):

 grouped_locn <- data %>% dplyr::arrange(Owner, Day) %>% dplyr::group_by(Owner, Day) %>% dplyr::summarize(Home = which(data$Locn %in% c('house', 'apartment')), Work = which(data$Locn %in% c("store", "office", "apartment"))) 

I included only my current attempt at the summation stage to show how I was approaching it. The Home and Work code currently returns line number vectors that contain a group element (for example, Home = 1 3 5)

My intended output:

  Owner Day Home Work 1 bob Mon 1 0 2 bob Tue 0 1 3 julia Mon 1 0 4 julia Tue 0 1 5 cheryl Mon 1 0 6 cheryl Tue 0 1 

In the actual dataset (30k + rows) there are several Locn values ​​for each owner per day, so counting Home and Work can be a number other than 1 and 0 (so there are no logical values).

Thank you very much.

+5
source share
4 answers

try it

 data %>% group_by(Owner, Day) %>% summarise(Home = sum(Locn %in% c("house", "apartment")), Work = sum(Locn %in% c("store", "office", "shop"))) 
+7
source

Here's a simple and efficient solution using data.table

For older versions (v <1.9.5)

 library(data.table) # v < 1.9.5 setDT(data)[, Locn2 := c("Work", "Home")[(Locn %in% c('house', 'apartment')) + 1L]] dcast.data.table(data, Owner + Day ~ Locn2, length) # Owner Day Home Work # 1: bob Mon 1 0 # 2: bob Tue 0 1 # 3: cheryl Mon 1 0 # 4: cheryl Tue 0 1 # 5: julia Mon 1 0 # 6: julia Tue 0 1 

For newer versions (v> = 1.9.5) you can do this in one line

 dcast(setDT(data), Owner + Day ~ c("Work", "Home")[(Locn %in% c('house', 'apartment')) + 1L], length) 

Here is a tidyr alternative

 library(dplyr) library(tidyr) data %>% mutate(temp = 1L, Locn = ifelse(Locn %in% c('house', 'apartment'), "Home", "Work")) %>% spread(Locn, temp, fill = 0L) # Owner Day Home Work # 1 bob Mon 1 0 # 2 bob Tue 0 1 # 3 cheryl Mon 1 0 # 4 cheryl Tue 0 1 # 5 julia Mon 1 0 # 6 julia Tue 0 1 
+9
source

You can use model.matrix from base R

 data[c('Work', 'Home')] <- model.matrix(~0+indx, transform(data, indx = Locn %in% c('house', 'apartment'))) data # Owner Day Locn Work Home #1 bob Mon house 0 1 #2 julia Tue store 1 0 #3 cheryl Mon apartment 0 1 #4 bob Tue office 1 0 #5 julia Mon house 0 1 #6 cheryl Tue shop 1 0 

Or

  library(qdapTools) data[c('Work', 'Home')] <- mtabulate(data$Locn %in% c('house', 'apartment')) 
+4
source

This is similar to @lukeA's proposed solution, but using the grepl function:

 library(dplyr) data %<>% arrange(Owner, Day) %>% group_by(Owner, Day) %>% summarise(Home=sum((grepl("house|apartment", Locn))*1), Work=sum((grepl("store|office|shop", Locn))*1)) 
+2
source

All Articles