Delete some rows in a row group in R

Question

Delete some rows in a row group in R

Suppose I have this dataset

Id Name Price sales Profit Month Category Mode Supplier 1 A 2 0 0 1 XK John 1 A 2 0 0 2 XK John 1 A 2 5 8 3 XK John 1 A 2 5 8 4 XL Sam 2 B 2 3 4 1 XL Sam 2 B 2 0 0 2 XL Sam 2 B 2 0 0 3 XM John 2 B 2 0 0 4 XL John 3 C 2 0 0 1 XK John 3 C 2 8 10 2 YM John 3 C 2 8 10 3 YK John 3 C 2 0 0 4 YK John 5 E 2 0 0 1 YM Sam 5 E 2 5 5 2 YL Sam 5 E 2 5 9 3 YM Sam 5 E 2 0 0 4 ZM Kyle 5 E 2 5 8 5 ZL Kyle 5 E 2 5 8 6 ZM Kyle

I want to delete rows with zeros for the Sales and Profit columns by the Id group. Therefore, for a specific Id , if two or more consecutive rows have zero values for Sales and Profit , these rows will be deleted. Thus, this data set will become so.

 Id Name Price sales Profit Month Category Mode Supplier 1 A 2 5 8 3 XK John 1 A 2 5 8 4 XL Sam 2 B 2 3 4 1 XL Sam 3 C 2 0 0 1 XK John 3 C 2 8 10 2 YM John 3 C 2 8 10 3 YK John 3 C 2 0 0 4 YK John 5 E 2 0 0 1 YM Sam 5 E 2 5 5 2 YL Sam 5 E 2 5 9 3 YM Sam 5 E 2 0 0 4 ZM Kyle 5 E 2 5 8 5 ZL Kyle 5 E 2 5 8 6 ZM Kyle

I can delete all rows if they have zero values for Sales and Profit with

 df1 = df[!(df$sales==0 & test$Profit==0),]

But how to delete rows only in a specific group in this case using Id

PS The idea is to delete entries for these products if they started selling after a few months or were left after a few months during the year.

+6

r

Jay khan Dec 15 '15 at 16:17

source share

4 answers

Here's how to do it with dplyr . Basically, I only support rows that are not non-zero or that the previous / next lines are non-zero.

 table1 %>% group_by(Id) %>% mutate(Lag=lag(sales),Lead=lead(sales)) %>% rowwise() %>% mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>% filter(sales>0|Min>0) %>% select(-Lead,-Lag,-Min) Id Name Price sales Profit Month Category Mode Supplier (int) (chr) (int) (int) (int) (int) (chr) (chr) (chr) 1 1 A 2 5 8 3 XK John 2 1 A 2 5 8 4 XL Sam 3 2 B 2 3 4 1 XL Sam 4 3 C 2 0 0 1 XK John 5 3 C 2 8 10 2 YM John 6 3 C 2 8 10 3 YK John 7 3 C 2 0 0 4 YK John 8 5 E 2 0 0 1 YM Sam 9 5 E 2 5 5 2 YL Sam 10 5 E 2 5 9 3 YM Sam 11 5 E 2 0 0 4 ZM Kyle 12 5 E 2 5 8 5 ZL Kyle 13 5 E 2 5 8 6 ZM Kyle

Data

 table1 <-read.table(text=" Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier 1,A,2,0,0,1,X,K,John 1,A,2,0,0,2,X,K,John 1,A,2,5,8,3,X,K,John 1,A,2,5,8,4,X,L,Sam 2,B,2,3,4,1,X,L,Sam 2,B,2,0,0,2,X,L,Sam 2,B,2,0,0,3,X,M,John 2,B,2,0,0,4,X,L,John 3,C,2,0,0,1,X,K,John 3,C,2,8,10,2,Y,M,John 3,C,2,8,10,3,Y,K,John 3,C,2,0,0,4,Y,K,John 5,E,2,0,0,1,Y,M,Sam 5,E,2,5,5,2,Y,L,Sam 5,E,2,5,9,3,Y,M,Sam 5,E,2,0,0,4,Z,M,Kyle 5,E,2,5,8,5,Z,L,Kyle 5,E,2,5,8,6,Z,M,Kyle ",sep=",",stringsAsFactors =FALSE, header=TRUE)

UPDATE To filter more than one column with these criteria, here's how to do it. In this case, the result is the same, because when sales are 0, the profit is also 0.

 library(dplyr) table1 %>% group_by(Id) %>% mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>% rowwise() %>% mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>% filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)

+3

Pierre lapointe Dec 15 '15 at 16:42

source share

I can not do this on one line, but here it is in three options:

 x <- df$sales==0 & df$Profit==0 y <- cumsum(c(1,head(x,-1)!=tail(x,-1))) df[ave(x,df$Id,y,FUN=sum)<2,] # Id Name Price sales Profit Month Category Mode Supplier # 3 1 A 2 5 8 3 XK John # 4 1 A 2 5 8 4 XL Sam # 5 2 B 2 3 4 1 XL Sam # 9 3 C 2 0 0 1 XK John # 10 3 C 2 8 10 2 YM John # 11 3 C 2 8 10 3 YK John # 12 3 C 2 0 0 4 YK John # 13 5 E 2 0 0 1 YM Sam # 14 5 E 2 5 5 2 YL Sam # 15 5 E 2 5 9 3 YM Sam # 16 5 E 2 0 0 4 ZM Kyle # 17 5 E 2 5 8 5 ZL Kyle # 18 5 E 2 5 8 6 ZM Kyle

This works by first identifying all rows where sales and Profit are 0 ( x ). Variable y groups consecutive values TRUE and FALSE . The ave() function separates the first input variable ( x ) according to subsequent input variables ( df$Id and y ), then applies the function inside the groups. Since the function is sum() , it will add all the TRUE values to x , then return a vector of the same length as x , so we just need to save all the lines where the result is less than 2.

+1

Sam dickson Dec 15 '15 at 16:37

source share

Here is my solution:

 aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x) with(x, cbind(rep(values, lengths), rep(lengths, lengths)))) df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),] Id Name Price sales Profit Month Category Mode Supplier 3 1 A 2 5 8 3 XK John 4 1 A 2 5 8 4 XL Sam 5 2 B 2 3 4 1 XL Sam 9 3 C 2 0 0 1 XK John 10 3 C 2 8 10 2 YM John 11 3 C 2 8 10 3 YK John 12 3 C 2 0 0 4 YK John 13 5 E 2 0 0 1 YM Sam 14 5 E 2 5 5 2 YL Sam 15 5 E 2 5 9 3 YM Sam 16 5 E 2 0 0 4 ZM Kyle 17 5 E 2 5 8 5 ZL Kyle 18 5 E 2 5 8 6 ZM Kyle

+1

Datamine r Dec 15 '15 at 17:03

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-12-15T16:44:00+0000

Here's an approach using rleid from "data.table":

 library(data.table) as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][ !(sales == 0 & Profit == 0 & N >= 2)] ## Id Name Price sales Profit Month Category Mode Supplier N ## 1: 1 A 2 5 8 3 XK John 2 ## 2: 1 A 2 5 8 4 XL Sam 2 ## 3: 2 B 2 3 4 1 XL Sam 1 ## 4: 3 C 2 0 0 1 XK John 1 ## 5: 3 C 2 8 10 2 YM John 2 ## 6: 3 C 2 8 10 3 YK John 2 ## 7: 3 C 2 0 0 4 YK John 1 ## 8: 5 E 2 0 0 1 YM Sam 1 ## 9: 5 E 2 5 5 2 YL Sam 2 ## 10: 5 E 2 5 9 3 YM Sam 2 ## 11: 5 E 2 0 0 4 ZM Kyle 1 ## 12: 5 E 2 5 8 5 ZL Kyle 2 ## 13: 5 E 2 5 8 6 ZM Kyle 2

Delete some rows in a row group in R

More articles: