Search for all duplicate rows, including "items with lower indices",

R duplicated returns a vector indicating whether each element of the vector or data frame is a duplicate of the element with the lower index. Therefore, if lines 3, 4 and 5 of a 5-line data frame are the same, duplicated will give me a vector

 FALSE, FALSE, FALSE, TRUE, TRUE 

But in this case, I really want to get

 FALSE, FALSE, TRUE, TRUE, TRUE 

that is, I want to know if the row next to the index is duplicated anymore .

+91
r r-faq duplicates
Oct 21 '11 at 19:37
source share
5 answers

duplicated has an argument fromLast . The Example Section ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the lines where either TRUE .




A bit late Edit: You did not provide a reproducible example, so here is an illustration courtesy of @jbaums

 vec <- c("a", "b", "c","c","c") vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)] ## [1] "c" "c" "c" 
+105
Oct 21 '11 at 19:56
source share

You need to compile a set of duplicated values, apply unique , and then check with %in% . As always, the problem with the sample will lead to the revitalization of this process.

 > vec <- c("a", "b", "c","c","c") > vec[ duplicated(vec)] [1] "c" "c" > unique(vec[ duplicated(vec)]) [1] "c" > vec %in% unique(vec[ duplicated(vec)]) [1] FALSE FALSE TRUE TRUE TRUE 
+33
oct. '11 at 19:49
source share

I had the same question , and if I'm not mistaken, this is also the answer.

 vec[col %in% vec[duplicated(vec$col),]$col] 

I don’t know which one is faster, however, the data set that I use now is not large enough to do tests that create significant intervals of time.

+3
Jun 01 '16 at 14:26
source share

Duplicate rows in the data frame can be obtained using dplyr by doing

 df = bind_rows(iris, head(iris, 20)) # build some test data df %>% group_by_all() %>% filter(n()>1) %>% ungroup() 

To exclude some columns, group_by_at(vars(-var1, -var2)) can be used instead to group data.

If you really need row indexes, not just data, you can add them first, as shown in:

 df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname) 
+1
Jun 17 '19 at 13:47 on
source share

If you are interested in which rows are duplicated for specific columns, you can use the plyr approach:

 ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c()) 

Adding the count variable with dplyr :

 df %>% add_count(col1, col2) %>% filter(n > 1) # data frame df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector 

For duplicate rows (including all columns):

 df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1) df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1 

The advantage of these approaches is that you can specify how many duplicates will be cut.

0
Jun 06 '19 at 21:14
source share