Removing duplicate pairs of columns, sorting rows based on 2 columns

in the next frame. I want to hold rows only once if they have duplicate pairs (1 4 and 4 1 are considered the same pair) Var1and Var2. I thought about sorting Var1and Var2in a row, and then removing duplicate rows based on both Var1, and Var2. However, I do not get the desired result.

Here's what my data looks like:

Var1 <- c(1,2,3,4,5,5)
Var2 <- c(4,3,2,1,5,5)
f <- c("blue","green","yellow","red","orange2","grey")
g <- c("blue","green","yellow","red","orange1","grey")
testdata <- data.frame(Var1,Var2,f,g)

I can sort inside the rows, however the values ​​of the columns f and g should remain intact, how to do this?

testdata <- t(apply(testdata, 1, function(x) x[order(x)]))
testdata <- as.data.table(testdata)

Then I want to remove duplicate rows based on Var1andVar2

I want to get this as a result:

Var1 Var2 f       g
1    4    blue    blue
2    3    green   green
5    5    orange2 orange1

Thank you for your help!

+4
source share
3 answers

"Var1", "Var2", duplicated

testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
#   Var1 Var2       f       g
#1    1    4    blue    blue
#2    2    3   green   green
#5    5    5 orange2 orange1
+4

dplyr:

library(dplyr)
testdata %>% 
   rowwise() %>%
   mutate(key = paste(sort(c(Var1, Var2)), collapse="")) %>%
   distinct(key, .keep_all=T) %>%
   select(-key)

# Source: local data frame [3 x 4]
# Groups: <by row>
# 
# # A tibble: 3 Γ— 4
#    Var1  Var2       f       g
#   <dbl> <dbl>  <fctr>  <fctr>
# 1     1     4    blue    blue
# 2     2     3   green   green
# 3     5     5 orange2 orange1
+4

If the data is large, as in Sorting large amounts of data and storing duplicate pairs of values ​​in R , using apply()each row will be expensive. Instead, create a set of unique values

uid = unique(unlist(testdata[c("Var1", "Var2")], use.names=FALSE))

determine if a swap is needed

swap = match(testdata[["Var1"]], uid) > match(testdata[["Var2"]], uid)

and update

tmp = testdata[swap, "Var1"]
testdata[swap, "Var1"] = testdata[swap, "Var2"]
testdata[swap, "Var2"] = tmp

remove duplicates still

testdata[!duplicated(testdata[1:2]),]

If there were many additional columns, and copying them was an expensive, more self-contained solution would be

uid = unique(unlist(testdata[c("Var1", "Var2")], use.names=FALSE))
swap = match(testdata[["Var1"]], uid) > match(testdata[["Var2"]], uid)
idx = !duplicated(data.frame(
    V1 = ifelse(swap, testdata[["Var2"]], testdata[["Var1"]]),
    V2 = ifelse(swap, testdata[["Var1"]], testdata[["Var2"]])))
testdata[idx, , drop=FALSE]
+3
source

All Articles