Removing duplicate rows with dplyr

Question

Removing duplicate rows with dplyr

I have data.frame like this -

set.seed(123) df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10) > df xyz 1 0 1 1 2 1 0 2 3 0 1 3 4 1 1 4 5 1 0 5 6 0 1 6 7 1 0 7 8 1 0 8 9 1 0 9 10 0 1 10

I would like to remove duplicate rows based on the first two columns. Expected Result -

 df[!duplicated(df[,1:2]),] xyz 1 0 1 1 2 1 0 2 4 1 1 4

I am specifically looking for a solution using the dplyr package.

+105

r dplyr

Nishanth Apr 09 '14 at 10:22

source share

6 answers

Here is a solution using dplyr 0.3 .

 library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 ) > df %>% distinct(x, y) xyz 1 0 1 1 2 1 0 2 3 1 1 4

Updated for dplyr 0.5

dplyr version 0.5 default behavior distinct() returns only the columns specified in the argument ...

To achieve the initial result, you should now use:

 df %>% distinct(x, y, .keep_all = TRUE)

+172

davechilders Oct 10 '14 at 2:59

source share

In order to complete the work also works:

 df %>% group_by(x) %>% filter (! duplicated(y))

However, I prefer the solution using distinct , and I suspect it is faster.

+24

Konrad Rudolph Dec 04 '14 at 11:19

source share

When choosing columns in R for a reduced data set, you can often get duplicates.

These two lines give the same result. Each displays a unique dataset with only two selected columns:

 distinct(mtcars, cyl, hp); summarise(group_by(mtcars, cyl, hp));

+2

Anton Andreev Jun 16 '17 at 11:13 on

source share

In most cases, the best solution is to use Different distinct() from dplyr, as already suggested.

However, there is another approach here that uses the slice() function from dplyr.

 # Generate fake data for the example library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 ) # In each group of rows formed by combinations of x and y # retain only the first row df %>% group_by(x, y) %>% slice(1)

Difference from using the Different `distinct()` function

The advantage of this solution is that it makes it clear which rows are stored in the original data frame, and it can be combined perfectly with arrange() function.

Suppose you have customer sales data, and you want to keep one record for each customer, and you want this record to be one of their last purchase. Then you could write:

 customer_purchase_data %>% arrange(desc(Purchase_Date)) %>% group_by(Customer_ID) %>% slice(1)

0

bschneidr Feb 12 '19 at 23:04

source share

If you want to find duplicate strings, you can use find_duplicates from hablar :

 library(dplyr) library(hablar) df <- tibble(a = c(1, 2, 2, 4), b = c(5, 2, 2, 8)) df %>% find_duplicates()

0

davsjob Jun 10 '19 at 21:29

source share

hadley · Accepted Answer · 2014-04-09 10:48

Note: dplyr now contains a distinct function for this purpose.

Original answer below:

 library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 )

One approach is to group and then leave only the first row:

 df %>% group_by(x, y) %>% filter(row_number(z) == 1) ## Source: local data frame [3 x 3] ## Groups: x, y ## ## xyz ## 1 0 1 1 ## 2 1 0 2 ## 3 1 1 4

(In dplyr 0.2, you don’t need the dummy variable z and you can just write row_number() == 1 )

I also thought about adding a slice() function that would work like this:

 df %>% group_by(x, y) %>% slice(from = 1, to = 1)

Or perhaps a unique() option that allows you to choose which variables to use:

 df %>% unique(x, y)

Removing duplicate rows with dplyr

Difference from using the Different distinct() function

More articles:

Difference from using the Different `distinct()` function