A subset of several conditions

Perhaps this is something basic, but I could not find the answer.
I have

Id Year V1 1 2009 33 1 2010 67 1 2011 38 2 2009 45 3 2009 65 3 2010 74 4 2009 47 4 2010 51 4 2011 14 

I need to select only rows with the same identifier, but this is for three years 2009, 2010 and 2011.

 Id Year V1 1 2009 33 1 2010 67 1 2011 38 4 2009 47 4 2010 51 4 2011 14 

I'm trying to

 d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 ) 

but that will not work.

Can someone give some suggestions on how I can do this in R?

+6
source share
4 answers

I think ave might be useful here. I call your original data frame 'df'. For each identifier, check if 2009-2011 is present in the year ( 2009:2011 %in% x ). This gives a logical vector, which may be sum med. Check if the sum is 3 (if all Years are present, the sum is 3), which leads to a new logical vector, which is used to subset the rows of the data frame.

 df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ] # Id Year V1 # 1 1 2009 33 # 2 1 2010 67 # 3 1 2011 38 # 7 4 2009 47 # 8 4 2010 51 # 9 4 2011 14 
+4
source

This should complete the task :)

 library(plyr) ds<-ddply(ds,.(Id),mutate,Nobs=length(Year)) ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,] 
+2
source

Another way to use ave

 DF ## Id Year V1 ## 1 1 2009 33 ## 2 1 2010 67 ## 3 1 2011 38 ## 4 2 2009 45 ## 5 3 2009 65 ## 6 3 2010 74 ## 7 4 2009 47 ## 8 4 2010 51 ## 9 4 2011 14 DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ] ## Id Year V1 ## 1 1 2009 33 ## 2 1 2010 67 ## 3 1 2011 38 ## 7 4 2009 47 ## 8 4 2010 51 ## 9 4 2011 14 
+2
source

I think the approach using ave is reasonable. But there are many ways to solve this problem. I will show several other ways to use the R base. Then in the last two examples I will introduce the data.table package.

Again, just throwing it away to provide some use cases for different aspects of the language.

 d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14)) # long way use_years <- as.character(2009:2011) cnts <- table(d1[,c("ID","Year")])[,use_years] use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)] d1[d1[,"ID"]%in%use_id,] # 1 1 2009 33 # 2 1 2010 67 # 3 1 2011 38 # 7 4 2009 47 # 8 4 2010 51 # 9 4 2011 14 # another longish way ind1 <- d1[,"Year"]%in%2009:2011 d1_ind <- d1[ind1,"ID"] ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3] d1[ind1,][ind2,] # ID Year V1 # 1 1 2009 33 # 2 1 2010 67 # 3 1 2011 38 # 7 4 2009 47 # 8 4 2010 51 # 9 4 2011 14 

OK, try a few methods using data.table. One of my favorite packages of all time. Maybe at first it’s a little tricky, so make sure your shoes are hard (oh, yes, that's fast!) :)

 # medium way library(data.table) d2 <- as.data.table(d1) d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]] # ID Year V1 # 1: 1 2009 33 # 2: 1 2010 67 # 3: 1 2011 38 # 4: 4 2009 47 # 5: 4 2010 51 # 6: 4 2011 14 # short way d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]] # ID Year V1 # 1: 1 2009 33 # 2: 1 2010 67 # 3: 1 2011 38 # 4: 4 2009 47 # 5: 4 2010 51 # 6: 4 2011 14 
0
source

All Articles