Find duplicate groups in a data frame with all but one column

Question

Find duplicate groups in a data frame with all but one column

I have a large data size. For some purposes, I need to do the following:

Select one column in this data frame.
Iterate over all rows of a given data frame except the selected column
Select all rows of this data frame equal to all elements except one selected column.
Group them in the same way that the group name is the row index, and the group values are the indices of the duplicated rows.

I wrote a function for this task, but it is slow due to a nested loop. I would like to get some ideas on how this code can be improved.

Let's say we have a data frame like this:

V1 V2 V3 V4 1 1 2 1 2 2 1 2 2 1 3 1 1 1 2 4 1 1 2 1 5 2 2 1 2

And we want to get this list as a result:

 diff.dataframe("V2", conf.new, conf.new)

Ouput:

 $`1` [1] 1 $`2` [1] 2 $`3` [1] 1 3 $`4` [1] 2 4 $`5` [1] 5

The following code uses the target, but is too slow. Is it possible to improve it somehow?

 diff.dataframe <- function(param, df1, df2){ excl.names <- c(param) df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE) df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE) list.out <- list() for (i in 1:nrow(df1.excl)){ for (j in 1:nrow(df2.excl)){ if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){ if (!as.character(i) %in% unlist(list.out)){ list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j) } } } } return(list.out) }

+4

r dataframe

annndrey Dec 13 '12 at 9:59

source share

1 answer

Backlin · Accepted Answer · 2012-12-13T12:44:47+0000

First we will generate some data

 df <- as.data.frame(matrix(sample(2, 20, TRUE), 5)) # Produces df like this V1 V2 V3 V4 1 2 1 1 1 2 2 1 2 2 3 1 1 2 2 4 1 2 1 1 5 1 2 1 1

Then we lapply through the lines using lapply . Then each row i compared with all df lines with apply (including itself). Lines with <= 1 differences return TRUE , the rest return FALSE , creating a logical vector, which we will convert to a numerical vector with which .

 lapply(1:nrow(df), function(i) apply(df, 1, function(x) which(sum(x != df[i,]) <= 1))) # Produces output like this [[1]] [1] 1 [[2]] [1] 2 3 [[3]] [1] 2 3 [[4]] [1] 4 5 [[5]] [1] 4 5

Find duplicate groups in a data frame with all but one column

More articles: