Find duplicate groups in a data frame with all but one column

I have a large data size. For some purposes, I need to do the following:

  • Select one column in this data frame.
  • Iterate over all rows of a given data frame except the selected column
  • Select all rows of this data frame equal to all elements except one selected column.
  • Group them in the same way that the group name is the row index, and the group values ​​are the indices of the duplicated rows.

I wrote a function for this task, but it is slow due to a nested loop. I would like to get some ideas on how this code can be improved.

Let's say we have a data frame like this:

V1 V2 V3 V4 1 1 2 1 2 2 1 2 2 1 3 1 1 1 2 4 1 1 2 1 5 2 2 1 2 

And we want to get this list as a result:

 diff.dataframe("V2", conf.new, conf.new) 

Ouput:

 $`1` [1] 1 $`2` [1] 2 $`3` [1] 1 3 $`4` [1] 2 4 $`5` [1] 5 

The following code uses the target, but is too slow. Is it possible to improve it somehow?

 diff.dataframe <- function(param, df1, df2){ excl.names <- c(param) df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE) df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE) list.out <- list() for (i in 1:nrow(df1.excl)){ for (j in 1:nrow(df2.excl)){ if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){ if (!as.character(i) %in% unlist(list.out)){ list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j) } } } } return(list.out) } 
+4
source share
1 answer

First we will generate some data

 df <- as.data.frame(matrix(sample(2, 20, TRUE), 5)) # Produces df like this V1 V2 V3 V4 1 2 1 1 1 2 2 1 2 2 3 1 1 2 2 4 1 2 1 1 5 1 2 1 1 

Then we lapply through the lines using lapply . Then each row i compared with all df lines with apply (including itself). Lines with <= 1 differences return TRUE , the rest return FALSE , creating a logical vector, which we will convert to a numerical vector with which .

 lapply(1:nrow(df), function(i) apply(df, 1, function(x) which(sum(x != df[i,]) <= 1))) # Produces output like this [[1]] [1] 1 [[2]] [1] 2 3 [[3]] [1] 2 3 [[4]] [1] 4 5 [[5]] [1] 4 5 
+1
source

All Articles