I have a large data size. For some purposes, I need to do the following:
- Select one column in this data frame.
- Iterate over all rows of a given data frame except the selected column
- Select all rows of this data frame equal to all elements except one selected column.
- Group them in the same way that the group name is the row index, and the group values ββare the indices of the duplicated rows.
I wrote a function for this task, but it is slow due to a nested loop. I would like to get some ideas on how this code can be improved.
Let's say we have a data frame like this:
V1 V2 V3 V4 1 1 2 1 2 2 1 2 2 1 3 1 1 1 2 4 1 1 2 1 5 2 2 1 2
And we want to get this list as a result:
diff.dataframe("V2", conf.new, conf.new)
Ouput:
$`1` [1] 1 $`2` [1] 2 $`3` [1] 1 3 $`4` [1] 2 4 $`5` [1] 5
The following code uses the target, but is too slow. Is it possible to improve it somehow?
diff.dataframe <- function(param, df1, df2){ excl.names <- c(param) df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE) df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE) list.out <- list() for (i in 1:nrow(df1.excl)){ for (j in 1:nrow(df2.excl)){ if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){ if (!as.character(i) %in% unlist(list.out)){ list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j) } } } } return(list.out) }
source share