Removing duplicate rows with dplyr

I have data.frame like this -

set.seed(123) df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10) > df xyz 1 0 1 1 2 1 0 2 3 0 1 3 4 1 1 4 5 1 0 5 6 0 1 6 7 1 0 7 8 1 0 8 9 1 0 9 10 0 1 10 

I would like to remove duplicate rows based on the first two columns. Expected Result -

 df[!duplicated(df[,1:2]),] xyz 1 0 1 1 2 1 0 2 4 1 1 4 

I am specifically looking for a solution using the dplyr package.

+105
r dplyr
Apr 09 '14 at 10:22
source share
6 answers

Note: dplyr now contains a distinct function for this purpose.

Original answer below:




 library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 ) 

One approach is to group and then leave only the first row:

 df %>% group_by(x, y) %>% filter(row_number(z) == 1) ## Source: local data frame [3 x 3] ## Groups: x, y ## ## xyz ## 1 0 1 1 ## 2 1 0 2 ## 3 1 1 4 

(In dplyr 0.2, you don’t need the dummy variable z and you can just write row_number() == 1 )

I also thought about adding a slice() function that would work like this:

 df %>% group_by(x, y) %>% slice(from = 1, to = 1) 

Or perhaps a unique() option that allows you to choose which variables to use:

 df %>% unique(x, y) 
+113
Apr 09 '14 at 10:48
source share

Here is a solution using dplyr 0.3 .

 library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 ) > df %>% distinct(x, y) xyz 1 0 1 1 2 1 0 2 3 1 1 4 

Updated for dplyr 0.5

dplyr version 0.5 default behavior distinct() returns only the columns specified in the argument ...

To achieve the initial result, you should now use:

 df %>% distinct(x, y, .keep_all = TRUE) 
+172
Oct 10 '14 at 2:59
source share

In order to complete the work also works:

 df %>% group_by(x) %>% filter (! duplicated(y)) 

However, I prefer the solution using distinct , and I suspect it is faster.

+24
Dec 04 '14 at 11:19
source share

When choosing columns in R for a reduced data set, you can often get duplicates.

These two lines give the same result. Each displays a unique dataset with only two selected columns:

 distinct(mtcars, cyl, hp); summarise(group_by(mtcars, cyl, hp)); 
+2
Jun 16 '17 at 11:13 on
source share

In most cases, the best solution is to use Different distinct() from dplyr, as already suggested.

However, there is another approach here that uses the slice() function from dplyr.

 # Generate fake data for the example library(dplyr) set.seed(123) df <- data.frame( x = sample(0:1, 10, replace = T), y = sample(0:1, 10, replace = T), z = 1:10 ) # In each group of rows formed by combinations of x and y # retain only the first row df %>% group_by(x, y) %>% slice(1) 

Difference from using the Different distinct() function

The advantage of this solution is that it makes it clear which rows are stored in the original data frame, and it can be combined perfectly with arrange() function.

Suppose you have customer sales data, and you want to keep one record for each customer, and you want this record to be one of their last purchase. Then you could write:

 customer_purchase_data %>% arrange(desc(Purchase_Date)) %>% group_by(Customer_ID) %>% slice(1) 
0
Feb 12 '19 at 23:04
source share

If you want to find duplicate strings, you can use find_duplicates from hablar :

 library(dplyr) library(hablar) df <- tibble(a = c(1, 2, 2, 4), b = c(5, 2, 2, 8)) df %>% find_duplicates() 
0
Jun 10 '19 at 21:29
source share



All Articles