Delete duplicate row values by columns in large data.frame

Question

Delete duplicate row values by columns in large data.frame

I have data.frame

 ID code1 code2 code3 A 143 143 144 A 35 453 35 A 35 15 B 46 46 45 B 12 43 765 C 255 455 344 C 343 343 343 C 343 23 23

each code appears at a time

identifier can be repeated. Real data.frame is very big

 ID code1 code2 code3 A 143 144 A 35 453 A 35 15 B 46 45 B 12 43 765 C 255 455 344 C 343 C 343 23

thanks

+4

r duplicate-removal

Hannah Apr 28 '13 at 12:33

source share

2 answers

How about this?

 df[, 2:4][t(apply(df[,2:4], 1, duplicated))] <- NA

Edit: Faster basic solution:

 for (i in 2:(ncol(df)-1)) { for (j in (i+1):ncol(df)) { chk <- df[[i]] == df[[j]] df[[j]][chk] <- NA } }

Here is an example of the two above methods along with AnandaMahto reshape2 and data.table for big data. Using for-loop with the correct indices i and j seems the fastest.

Benchmarking Results:

 require(microbenchmark) microbenchmark(ar.f <- arun.f(df), ar.s <- arun.s(df), an.f <- ananda.ave(df), an.s <- ananda.dt(copy(DT)), times=10) # Unit: milliseconds # expr min lq median uq max neval # arun.f(df) 4816.3937 5197.0626 6402.454 6955.9380 7534.6912 10 # arun.s(df) 114.8372 118.7971 149.284 202.6081 297.4787 10 # ananda.ave(df) 2877.7936 3288.5935 3650.660 3985.5390 4111.9064 10 # ananda.dt(copy(DT)) 3383.1229 3861.6379 4432.751 4776.6108 5368.6504 10

Data Creation:

 set.seed(1234) df <- cbind(data.frame(ID = rep(letters[1:20], each=1e4)), stringsAsFactors=FALSE), matrix(sample(1:10, 6 * 1e5, replace=TRUE), ncol=3)) names(df)[2:4] <- paste0("code", 1:3)

My first version:

 arun.f <- function(df) { df[, 2:4][t(apply(df[,2:4], 1, duplicated))] <- NA df }

My second version:

 arun.s <- function(df) { for (i in 2:(ncol(df)-1)) { for (j in (i+1):ncol(df)) { chk <- df[[i]] == df[[j]] df[[j]][chk] <- NA } } df }

Ananda `ave` + `reshape2` :

 library(reshape2) ananda.ave <- function(df) { df$ID2 <- with(df, ave(ID, ID, FUN = seq_along)) m.df <- melt(df, id.vars=c("ID", "ID2")) m.df[duplicated(m.df[setdiff(names(m.df), "variable")]), "value"] <- NA dcast(m.df, ID + ID2 ~ variable) }

Ananda `data.table` :

(modified bit for greater optimization)

 library(data.table) DT <- data.table(df) ananda.dt <- function(dt) { temp <- dt[, list(ID2 = 1:.N, Value = unlist(.SD, use.names=FALSE)), by ="ID"] temp[duplicated(temp), Value := NA] out <- setnames(temp[, as.list(Value), by=list(ID, ID2)], 3:5, paste0("code", 1:3)) }

+4

Arun Apr 28 '13 at 12:44

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-04-28T16:19:13+0000

This solution is likely to be somewhat ineffective, but this is mainly due to the fact that data is transferred between large and long and wide. However, it may be easier for you to work with your data in a "long" form.

First create a second identifier, as you have identifiers spanning multiple lines.

 mydf$ID2 <- with(mydf, ave(ID, ID, FUN = seq_along))

Secondly, use melt from the "reshape2" package to keep your data in long form.

 library(reshape2) m.df <- melt(mydf, id.vars=c("ID", "ID2"))

With data in its long form, it is much easier to identify duplicates and replace them with NA .

 m.df[duplicated(m.df[setdiff(names(m.df), "variable")]), "value"] <- NA

If you are satisfied with your data in a long form. To stop there. If you want to return it to its wide form, use dcast (again from "reshape2").

 dcast(m.df, ID + ID2 ~ variable) # ID ID2 code1 code2 code3 # 1 A 1 143 NA 144 # 2 A 2 35 453 NA # 3 A 3 NA 35 15 # 4 B 1 46 NA 45 # 5 B 2 12 43 765 # 6 C 1 255 455 344 # 7 C 2 343 NA NA # 8 C 3 343 23 NA

For reference, this is also doable in the R database, but the syntax is more awkward (although it may work better than the equivalents of "reshape2").

 mydf$ID2 <- with(mydf, ave(ID, ID, FUN = seq_along)) m.df <- cbind(mydf[c("ID", "ID2")], stack(mydf[setdiff(names(mydf), c("ID", "ID2"))])) m.df[duplicated(m.df[setdiff(names(m.df), "ind")]), "values"] <- NA cbind(mydf[c("ID", "ID2")], unstack(m.df, values ~ ind))

Update: possible solution to `data.table`

You might want to examine data.table , as you mentioned that your data is big. Here's one possible solution (although @Arun may have a more direct sharing solution).

 library(data.table) DT <- data.table(mydf) ## Creates your long data.table temp <- DT[, list(ID2 = 1:.N, Value = unlist(.SD)), by ="ID"] ## Changes duplicates to NA and adds in the "Code" column temp[duplicated(temp), Value := NA][, Variable := rep(names(DT)[-1], each = nrow(DT))] ## "Reshapes" the data from long to wide temp[, as.list(setattr(Value, 'names', Variable)), by=list(ID, ID2)] # ID ID2 code1 code1 code1 # 1: A 1 143 NA 144 # 2: A 2 35 453 NA # 3: A 3 NA 35 15 # 4: B 1 46 NA 45 # 5: B 2 12 43 765 # 6: C 1 255 455 344 # 7: C 2 343 NA NA # 8: C 3 343 23 NA

Delete duplicate row values ​​by columns in large data.frame

Update: possible solution to data.table

Benchmarking Results:

Data Creation:

My first version:

My second version:

Ananda ave + reshape2 :

Ananda data.table :

More articles:

Delete duplicate row values by columns in large data.frame

Update: possible solution to `data.table`

Ananda `ave` + `reshape2` :

Ananda `data.table` :