Exclude groups that have different meanings

Question

Exclude groups that have different meanings

I have a data frame as shown below

sample <- data.frame(ID=1:9, Group=c('AA','AA','AA','BB','BB','CC','CC','BB','CC'), Value = c(1,1,1,2,2,2,3,2,3))

Each group should have the same meaning.

 ID Group Value 1 AA 1 2 AA 1 3 AA 1 4 BB 2 5 BB 2 6 CC 2 7 CC 3 8 BB 2 9 CC 3

If you look at the CC group, it does not have the same meaning. It ranges from 2 to 3.

I need to limit groups that do not have a unique meaning.

In the above case, the CC group must be deleted. The result should look like this

 ID Group Value 1 AA 1 2 AA 1 3 AA 1 4 BB 2 5 BB 2 8 BB 2

Could you tell me simple and fast code in R that solves the problem?

+6

r dataframe

Keith park Jan 29 '14 at 2:41

source share

4 answers

Here is the solution using dplyr:

 library(dplyr) sample <- data.frame( ID = 1:9, Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'), Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3) ) sample %>% group_by(Group) %>% filter(n_distinct(Value) == 1)

We group the data using Group , and then select only the groups where the number of different values of Value is 1.

+6

hadley Jan 29 '14 at 13:42

source share

data.table version:

 library(data.table) sample <- as.data.table(sample) sample[,if(length(unique(Value))==1) .SD ,by=Group] # Group ID Value #1: AA 1 1 #2: AA 2 1 #3: AA 3 1 #4: BB 4 2 #5: BB 5 2 #6: BB 8 2

An alternative using ave if the data is numeric is to check if the variance is 0:

 sample[with(sample, ave(Value, Group, FUN=var ))==0,]

An alternative solution that can be faster with big data:

 setkey(sample, Group, Value) ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]

The fact is that calculating unique values for each group can take a lot of time when there are more groups. Instead, we can set the key to data.table , and then take unique values by key (which is very fast), and then calculate the total values for each group. Then we need only those where it is 1. Then we can perform a join (which is again very fast). Here is an example of basic data:

 require(data.table) set.seed(1L) sample <- data.table(ID=1:1e7, Group = sample(rep(paste0("id", 1:1e5), each=100)), Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1))) system.time ( ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group] ) # minimum of three runs # user system elapsed # 14.328 0.066 14.382 system.time ({ setkey(sample, Group, Value) ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]] }) # minimum of three runs # user system elapsed # 5.661 0.219 5.877 setkey(ans1, Group, ID) setkey(ans2, Group, ID) identical(ans1, ans2) # [1] TRUE

+5

thelatemail Jan 29 '14 at 2:58

source share

Here is the approach

 > ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2] > sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ] ID Group Value 1 1 AA 1 2 2 AA 1 3 3 AA 1 4 4 BB 2 5 5 BB 2 8 8 BB 2

+1

Jilber urbina Jan 29 '14 at 2:53

source share

John · Accepted Answer · 2014-01-29T02:56:35+0000

You can make a selector for sample using ave many ways.

 sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]

or

 sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]

or

 sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]

Exclude groups that have different meanings

More articles: