Exclude groups that have different meanings

I have a data frame as shown below

sample <- data.frame(ID=1:9, Group=c('AA','AA','AA','BB','BB','CC','CC','BB','CC'), Value = c(1,1,1,2,2,2,3,2,3)) 

Each group should have the same meaning.

 ID Group Value 1 AA 1 2 AA 1 3 AA 1 4 BB 2 5 BB 2 6 CC 2 7 CC 3 8 BB 2 9 CC 3 

If you look at the CC group, it does not have the same meaning. It ranges from 2 to 3.

I need to limit groups that do not have a unique meaning.

In the above case, the CC group must be deleted. The result should look like this

 ID Group Value 1 AA 1 2 AA 1 3 AA 1 4 BB 2 5 BB 2 8 BB 2 

Could you tell me simple and fast code in R that solves the problem?

+6
source share
4 answers

You can make a selector for sample using ave many ways.

 sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,] 

or

 sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,] 

or

 sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,] 
+3
source

Here is the solution using dplyr:

 library(dplyr) sample <- data.frame( ID = 1:9, Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'), Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3) ) sample %>% group_by(Group) %>% filter(n_distinct(Value) == 1) 

We group the data using Group , and then select only the groups where the number of different values ​​of Value is 1.

+6
source

data.table version:

 library(data.table) sample <- as.data.table(sample) sample[,if(length(unique(Value))==1) .SD ,by=Group] # Group ID Value #1: AA 1 1 #2: AA 2 1 #3: AA 3 1 #4: BB 4 2 #5: BB 5 2 #6: BB 8 2 

An alternative using ave if the data is numeric is to check if the variance is 0:

 sample[with(sample, ave(Value, Group, FUN=var ))==0,] 

An alternative solution that can be faster with big data:

 setkey(sample, Group, Value) ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]] 

The fact is that calculating unique values ​​for each group can take a lot of time when there are more groups. Instead, we can set the key to data.table , and then take unique values ​​by key (which is very fast), and then calculate the total values ​​for each group. Then we need only those where it is 1. Then we can perform a join (which is again very fast). Here is an example of basic data:

 require(data.table) set.seed(1L) sample <- data.table(ID=1:1e7, Group = sample(rep(paste0("id", 1:1e5), each=100)), Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1))) system.time ( ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group] ) # minimum of three runs # user system elapsed # 14.328 0.066 14.382 system.time ({ setkey(sample, Group, Value) ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]] }) # minimum of three runs # user system elapsed # 5.661 0.219 5.877 setkey(ans1, Group, ID) setkey(ans2, Group, ID) identical(ans1, ans2) # [1] TRUE 
+5
source

Here is the approach

 > ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2] > sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ] ID Group Value 1 1 AA 1 2 2 AA 1 3 3 AA 1 4 4 BB 2 5 5 BB 2 8 8 BB 2 
+1
source

All Articles