Why dplyr :: distinct behaves the same as for grouped data frames

My question includes the distinct function of dplyr .

First set up the data:

 set.seed(0) df <- data.frame( x = sample(10, 100, rep = TRUE), y = sample(10, 100, rep = TRUE) ) 

Consider the following two uses of distinct .

 df %>% group_by(x) %>% distinct() df %>% group_by(x) %>% distinct(y) 

The first produces a different result for the second. As far as I can tell, the first set of operations finds "All different values ​​of x and returns the first value of y ", where, when the second finds "For each value of x , find all different values ​​of y ".

Why should this be so when

 df %>% distinct(x, y) df %>% distinct() 

gives the same result?

EDIT: It looks like this is already a known bug: https://github.com/hadley/dplyr/issues/1110

+6
source share
1 answer

As far as I can tell, the answer is that distinct considers column grouping when determining distinctness, which for me seems incompatible with how the rest of dplyr .

In this way:

 df %>% group_by(x) %>% distinct() 

Group x , find values ​​that differ in x (!). This seems to be a mistake.

But:

 df %>% group_by(x) %>% distinct(y) 

Group x , find values ​​that differ in y given x . This is equivalent to any of these cases:

 df %>% distinct(x, y) df %>% distinct() 

Both find different values ​​in x and y.

It seems that the message about returning home: Do not use grouping and distinct . Just use the appropriate column names as arguments in distinct .

+2
source

All Articles