Why dplyr :: distinct behaves the same as for grouped data frames

Question

Why dplyr :: distinct behaves the same as for grouped data frames

My question includes the distinct function of dplyr .

First set up the data:

 set.seed(0) df <- data.frame( x = sample(10, 100, rep = TRUE), y = sample(10, 100, rep = TRUE) )

Consider the following two uses of distinct .

 df %>% group_by(x) %>% distinct() df %>% group_by(x) %>% distinct(y)

The first produces a different result for the second. As far as I can tell, the first set of operations finds "All different values of x and returns the first value of y ", where, when the second finds "For each value of x , find all different values of y ".

Why should this be so when

 df %>% distinct(x, y) df %>% distinct()

gives the same result?

EDIT: It looks like this is already a known bug: https://github.com/hadley/dplyr/issues/1110

+6

r dplyr

Alex Jul 07 '15 at 4:01

source share

1 answer

Claus wilke · Accepted Answer · 2015-07-07T04:08:37+0000

As far as I can tell, the answer is that distinct considers column grouping when determining distinctness, which for me seems incompatible with how the rest of dplyr .

In this way:

 df %>% group_by(x) %>% distinct()

Group x , find values that differ in x (!). This seems to be a mistake.

But:

 df %>% group_by(x) %>% distinct(y)

Group x , find values that differ in y given x . This is equivalent to any of these cases:

 df %>% distinct(x, y) df %>% distinct()

Both find different values in x and y.

It seems that the message about returning home: Do not use grouping and distinct . Just use the appropriate column names as arguments in distinct .

Why dplyr :: distinct behaves the same as for grouped data frames

More articles: