How to identify only non-duplicated rows

Question

How to identify only non-duplicated rows

I have such a situation. Some data. The table is "rbinded".

library(data.table) x <- data.table(id=c(1,2,3,4),dsp=c(5,6,7,8),status=c(FALSE,TRUE,FALSE,TRUE)) y <- data.table(id=c(1,2,3,4),dsp=c(6,6,7,8),status=c(FALSE,FALSE,FALSE,TRUE)) z <- data.table(id=c(1,2,3,4),dsp=c(5,6,9,8),status=c(FALSE,TRUE,FALSE,FALSE)) w <- data.table(id=c(1,2,3,4),dsp=c(5,6,7,NA),status=c(FALSE,TRUE,FALSE,TRUE)) setkey(x,id) setkey(y,id) setkey(z,id) setkey(w,id) Bigdt<-rbind(x,y,z,w)

I want to get ONLY non-duplicate lines like:

 id dsp status 1 6 FALSE 2 6 FALSE 3 9 FALSE 4 8 FALSE 4 NA TRUE

So i tried

 Resultdt<-Bigdt[!duplicated(Bigdt)]

but the result:

 id dsp status 1 5 FALSE 2 6 TRUE 3 7 FALSE 4 8 TRUE

does not meet my expectations. I tried to use different methods (since rbind is optional), for example, merging, combining, etc., the data.table package seems to be potentially the one that contains the solution ... apparently. Any ideas?

+7

r data.table

Antonello Salis May 27 '16 at 14:57

source share

2 answers

Frank · Answer 1 · 2016-05-27T15:10:13+0000

You can do

 Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL][] id dsp status 1: 1 6 FALSE 2: 2 6 FALSE 3: 3 9 FALSE 4: 4 8 FALSE 5: 4 NA TRUE

To find out how this works, run only part of the DT[][][][] chain DT[][][][] :

Bigdt[, .N, by=names(Bigdt)]
Bigdt[, .N, by=names(Bigdt)][N == 1L]
Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL]

akrun · Answer 2 · 2016-05-27T20:43:54+0000

You can also try

 Bigdt[!(duplicated(Bigdt)|duplicated(Bigdt, fromLast=TRUE))] # id dsp status #1: 1 6 FALSE #2: 2 6 FALSE #3: 3 9 FALSE #4: 4 8 FALSE #5: 4 NA TRUE

Or if we use .SD

 Bigdt[Bigdt[,!(duplicated(.SD)|duplicated(.SD, fromLast=TRUE))]]

Or another option would group by column names, find the index of the row with .I and a subset of the dataset

 Bigdt[Bigdt[, .I[.N==1], by = names(Bigdt)]$V1]

How to identify only non-duplicated rows

More articles: