Delete some rows in a row group in R

Suppose I have this dataset

Id Name Price sales Profit Month Category Mode Supplier 1 A 2 0 0 1 XK John 1 A 2 0 0 2 XK John 1 A 2 5 8 3 XK John 1 A 2 5 8 4 XL Sam 2 B 2 3 4 1 XL Sam 2 B 2 0 0 2 XL Sam 2 B 2 0 0 3 XM John 2 B 2 0 0 4 XL John 3 C 2 0 0 1 XK John 3 C 2 8 10 2 YM John 3 C 2 8 10 3 YK John 3 C 2 0 0 4 YK John 5 E 2 0 0 1 YM Sam 5 E 2 5 5 2 YL Sam 5 E 2 5 9 3 YM Sam 5 E 2 0 0 4 ZM Kyle 5 E 2 5 8 5 ZL Kyle 5 E 2 5 8 6 ZM Kyle 

I want to delete rows with zeros for the Sales and Profit columns by the Id group. Therefore, for a specific Id , if two or more consecutive rows have zero values ​​for Sales and Profit , these rows will be deleted. Thus, this data set will become so.

 Id Name Price sales Profit Month Category Mode Supplier 1 A 2 5 8 3 XK John 1 A 2 5 8 4 XL Sam 2 B 2 3 4 1 XL Sam 3 C 2 0 0 1 XK John 3 C 2 8 10 2 YM John 3 C 2 8 10 3 YK John 3 C 2 0 0 4 YK John 5 E 2 0 0 1 YM Sam 5 E 2 5 5 2 YL Sam 5 E 2 5 9 3 YM Sam 5 E 2 0 0 4 ZM Kyle 5 E 2 5 8 5 ZL Kyle 5 E 2 5 8 6 ZM Kyle 

I can delete all rows if they have zero values ​​for Sales and Profit with

 df1 = df[!(df$sales==0 & test$Profit==0),] 

But how to delete rows only in a specific group in this case using Id

PS The idea is to delete entries for these products if they started selling after a few months or were left after a few months during the year.

+6
source share
4 answers

Here's an approach using rleid from "data.table":

 library(data.table) as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][ !(sales == 0 & Profit == 0 & N >= 2)] ## Id Name Price sales Profit Month Category Mode Supplier N ## 1: 1 A 2 5 8 3 XK John 2 ## 2: 1 A 2 5 8 4 XL Sam 2 ## 3: 2 B 2 3 4 1 XL Sam 1 ## 4: 3 C 2 0 0 1 XK John 1 ## 5: 3 C 2 8 10 2 YM John 2 ## 6: 3 C 2 8 10 3 YK John 2 ## 7: 3 C 2 0 0 4 YK John 1 ## 8: 5 E 2 0 0 1 YM Sam 1 ## 9: 5 E 2 5 5 2 YL Sam 2 ## 10: 5 E 2 5 9 3 YM Sam 2 ## 11: 5 E 2 0 0 4 ZM Kyle 1 ## 12: 5 E 2 5 8 5 ZL Kyle 2 ## 13: 5 E 2 5 8 6 ZM Kyle 2 
+5
source

Here's how to do it with dplyr . Basically, I only support rows that are not non-zero or that the previous / next lines are non-zero.

 table1 %>% group_by(Id) %>% mutate(Lag=lag(sales),Lead=lead(sales)) %>% rowwise() %>% mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>% filter(sales>0|Min>0) %>% select(-Lead,-Lag,-Min) Id Name Price sales Profit Month Category Mode Supplier (int) (chr) (int) (int) (int) (int) (chr) (chr) (chr) 1 1 A 2 5 8 3 XK John 2 1 A 2 5 8 4 XL Sam 3 2 B 2 3 4 1 XL Sam 4 3 C 2 0 0 1 XK John 5 3 C 2 8 10 2 YM John 6 3 C 2 8 10 3 YK John 7 3 C 2 0 0 4 YK John 8 5 E 2 0 0 1 YM Sam 9 5 E 2 5 5 2 YL Sam 10 5 E 2 5 9 3 YM Sam 11 5 E 2 0 0 4 ZM Kyle 12 5 E 2 5 8 5 ZL Kyle 13 5 E 2 5 8 6 ZM Kyle 

Data

 table1 <-read.table(text=" Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier 1,A,2,0,0,1,X,K,John 1,A,2,0,0,2,X,K,John 1,A,2,5,8,3,X,K,John 1,A,2,5,8,4,X,L,Sam 2,B,2,3,4,1,X,L,Sam 2,B,2,0,0,2,X,L,Sam 2,B,2,0,0,3,X,M,John 2,B,2,0,0,4,X,L,John 3,C,2,0,0,1,X,K,John 3,C,2,8,10,2,Y,M,John 3,C,2,8,10,3,Y,K,John 3,C,2,0,0,4,Y,K,John 5,E,2,0,0,1,Y,M,Sam 5,E,2,5,5,2,Y,L,Sam 5,E,2,5,9,3,Y,M,Sam 5,E,2,0,0,4,Z,M,Kyle 5,E,2,5,8,5,Z,L,Kyle 5,E,2,5,8,6,Z,M,Kyle ",sep=",",stringsAsFactors =FALSE, header=TRUE) 

UPDATE To filter more than one column with these criteria, here's how to do it. In this case, the result is the same, because when sales are 0, the profit is also 0.

 library(dplyr) table1 %>% group_by(Id) %>% mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>% rowwise() %>% mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>% filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP) 
+3
source

I can not do this on one line, but here it is in three options:

 x <- df$sales==0 & df$Profit==0 y <- cumsum(c(1,head(x,-1)!=tail(x,-1))) df[ave(x,df$Id,y,FUN=sum)<2,] # Id Name Price sales Profit Month Category Mode Supplier # 3 1 A 2 5 8 3 XK John # 4 1 A 2 5 8 4 XL Sam # 5 2 B 2 3 4 1 XL Sam # 9 3 C 2 0 0 1 XK John # 10 3 C 2 8 10 2 YM John # 11 3 C 2 8 10 3 YK John # 12 3 C 2 0 0 4 YK John # 13 5 E 2 0 0 1 YM Sam # 14 5 E 2 5 5 2 YL Sam # 15 5 E 2 5 9 3 YM Sam # 16 5 E 2 0 0 4 ZM Kyle # 17 5 E 2 5 8 5 ZL Kyle # 18 5 E 2 5 8 6 ZM Kyle 

This works by first identifying all rows where sales and Profit are 0 ( x ). Variable y groups consecutive values TRUE and FALSE . The ave() function separates the first input variable ( x ) according to subsequent input variables ( df$Id and y ), then applies the function inside the groups. Since the function is sum() , it will add all the TRUE values ​​to x , then return a vector of the same length as x , so we just need to save all the lines where the result is less than 2.

+1
source

Here is my solution:

 aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x) with(x, cbind(rep(values, lengths), rep(lengths, lengths)))) df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),] Id Name Price sales Profit Month Category Mode Supplier 3 1 A 2 5 8 3 XK John 4 1 A 2 5 8 4 XL Sam 5 2 B 2 3 4 1 XL Sam 9 3 C 2 0 0 1 XK John 10 3 C 2 8 10 2 YM John 11 3 C 2 8 10 3 YK John 12 3 C 2 0 0 4 YK John 13 5 E 2 0 0 1 YM Sam 14 5 E 2 5 5 2 YL Sam 15 5 E 2 5 9 3 YM Sam 16 5 E 2 0 0 4 ZM Kyle 17 5 E 2 5 8 5 ZL Kyle 18 5 E 2 5 8 6 ZM Kyle 
+1
source

All Articles