Delete duplicate rows

Question

Delete duplicate rows

I read the CSV file in R. data.frame file. Some rows have the same element in one of the columns. I would like to delete rows that are duplicates in this column. For example:

 platform_external_dbus 202 16 google 1 platform_external_dbus 202 16 space-ghost.verbum 1 platform_external_dbus 202 16 localhost 1 platform_external_dbus 202 16 users.sourceforge 8 platform_external_dbus 202 16 hughsie 1

I would like only one of these rows, since the rest have the same data in the first column.

+134

r r-faq duplicates

user1897691 Dec 20 '12 at 7:17

source share

7 answers

For those who came here to find a common answer for removing duplicate rows, use !duplicated() :

 a <- c(rep("A", 3), rep("B", 3), rep("C",2)) b <- c(1,1,2,4,1,1,2,2) df <-data.frame(a,b) duplicated(df) [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE > df[duplicated(df), ] ab 2 A 1 6 B 1 8 C 2 > df[!duplicated(df), ] ab 1 A 1 3 A 2 4 B 4 5 B 1 7 C 2

From: Removing duplicate rows from data frame R

+169

Mehdi Nellen Feb 24 '14 at 12:07

source share

The distinct() function in the dplyr package performs arbitrary deletion of duplicates, allowing the specification of duplicated variables (as in this question) or considering all variables.

Data:

 dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))

Delete rows in which the specified columns are duplicated:

 library(dplyr) dat %>% distinct(a, .keep_all = TRUE) ab 1 1 A 2 2 B

Delete lines that are complete duplicates of other lines:

 dat %>% distinct ab 1 1 A 2 2 B 3 1 C 4 2 D

+69

Sam Firke May 29 '15 at 1:10

source share

The data.table package also has its own unique and duplicated methods with some additional features.

Both unique.data.table and duplicated.data.table methods have an additional by argument, which allows you to pass a character or integer vector of column names or their locations, respectively

 library(data.table) DT <- data.table(id = c(1,1,1,2,2,2), val = c(10,20,30,10,20,30)) unique(DT, by = "id") # id val # 1: 1 10 # 2: 2 10 duplicated(DT, by = "id") # [1] FALSE TRUE TRUE FALSE TRUE TRUE

Another important feature of these methods is the huge performance boost for large datasets.

 library(microbenchmark) library(data.table) set.seed(123) DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10)) DT <- copy(DF) setDT(DT) microbenchmark(unique(DF), unique(DT)) # Unit: microseconds # expr min lq mean median uq max neval cld # unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18 100 b # unique(DT) 746.855 776.6145 2201.657 864.932 919.489 55986.88 100 a microbenchmark(duplicated(DF), duplicated(DT)) # Unit: microseconds # expr min lq mean median uq max neval cld # duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170 100 b # duplicated(DT) 551.982 558.2215 851.0246 639.9795 663.658 5805.243 100 a

+27

David Arenburg Mar 17 '16 at 11:01

source share

A general answer could be for example:

 df <- data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6)))) new_df <- df[-which(duplicated(df)), ]

Exit:

  X1 X2 X3 1 2 9 6 2 4 6 7

+6

Amit Gupta Dec 23 '17 at 10:54 on

source share

With sqldf :

 # Example by Mehdi Nellen a <- c(rep("A", 3), rep("B", 3), rep("C",2)) b <- c(1,1,2,4,1,1,2,2) df <-data.frame(a,b)

Decision:

  library(sqldf) sqldf('SELECT DISTINCT * FROM df')

Output:

  ab 1 A 1 2 A 2 3 B 4 4 B 1 5 C 2

+5

mpalanco Jun 24 '15 at 10:46

source share

Or you can nest the data in columns 4 and 5 on the same line using tidyr :

 library(tidyr) df %>% nest(V4:V5) # A tibble: 1 × 4 # V1 V2 V3 data # <fctr> <int> <int> <list> #1 platform_external_dbus 202 16 <tibble [5 × 2]>

Duplicates col 2 and 3 are now deleted for statistical analysis, but you saved the col 4 and 5 data as halftones and can return to the original data frame at any point using unnest() .

+3

Joe Oct 22 '16 at 11:16

source share

Anthony Damico · Accepted Answer · 2012-12-20 07:21

just select your data frame in the desired columns, then use the unique function: D

 # in the above example, you only need the first three columns deduped.data <- unique( yourdata[ , 1:3 ] ) # the fourth column no longer 'distinguishes' them, # so they're duplicates and thrown out.

Delete duplicate rows

Exit:

More articles: