R: How to sum values ​​from strings only if the key value is the same, and also if the key is duplicated in consecutive lines?

I have the following data table for this structure:

+-------------------+ | id | key | value | +-----+-----+-------+ | 1 | A | 1000 | | 1 | A | 2000 | | 1 | B | 2001 | | 1 | A | 2002 | | 1 | A | 2004 | | 2 | B | 2002 | | 2 | C | 2002 | +-------------------+ 

My goal is to sum the values ​​by id and key, but instead of just grouping by id and key, I only want to sum the values ​​if the pair id and key are the same for consecutive lines.

The result should be:

 +-------------------+ | id | key | value | +-----+-----+-------+ | 1 | A | 3000 | | 1 | B | 2001 | | 1 | A | 4006 | | 2 | B | 2002 | | 2 | C | 2002 | +-------------------+ 

Anyway to achieve this result?

+5
r data.table
source share
1 answer

We can use rleid from devel version data.table ie v1.9.5 +. Installation instructions for the devel: here version.

We will convert 'data.frame' to 'data.table'. Create another grouping column "ind" from the "key" column. Grouped by "id" and "ind", we get the sum of "value" and get the first element "key". We can set 'ind' to NULL because it is not needed in the expected output.

 library(data.table) setDT(df1)[,list(value = sum(value), key=key[1L]), by = .(ind=rleid(key), id)][, ind:=NULL][] # id value key #1: 1 3000 A #2: 1 2001 B #3: 1 4006 A #4: 2 2002 B #5: 2 2002 C 

Or, as @Frank suggested, we can use mutliple columns in rleid , use it as a grouping variable, get the first element of other variables and sum from the value, set the unwanted column to NULL or use the standard subset of data.table using with = FALSE .

 setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) , by = .(r=rleid(id, key))][, r:= NULL][] # id key value #1: 1 A 3000 #2: 1 B 2001 #3: 1 A 4006 #4: 2 B 2002 #5: 2 C 2002 

Or we can use dplyr . We create the grouping variable 'ind' by comparing adjacent elements of 'key' and get the sum from the value 'and first element' key 'with summarise .

 library(dplyr) df1 %>% group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>% summarise(value=sum(value), key=first(key)) %>% ungroup() %>% select(-ind) # id value key #1 1 3000 A #2 1 2001 B #3 1 4006 A #4 2 2002 B #5 2 2002 C 

NOTE. In dplyr and data.table we can also put the key column as a grouping variable and delete key=key[1L] or key=first(key)) .


Or we transform data set by creating the "ind" column and use aggregate from base R to get the expected output

 df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1)))) aggregate(value~., df1, FUN=sum)[-3] # id key value #1 1 A 3000 #2 1 B 2001 #3 1 A 4006 #4 2 B 2002 #5 2 C 2002 

data

 df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A", "A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L, 2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value" ), class = "data.frame", row.names = c(NA, -7L)) 
+8
source share

All Articles