Disjoint values in R

Question

Disjoint values in R

I have two data sets of at least 420,500 observations each, for example.

dataset1 <- data.frame(col1=c("microsoft","apple","vmware","delta","microsoft"), col2=paste0(c("a","b","c",4,"asd"),".exe"), col3=rnorm(5)) dataset2 <- data.frame(col1=c("apple","cisco","proactive","dtex","microsoft"), col2=paste0(c("a","b","c",4,"asd"),".exe"), col3=rnorm(5)) > dataset1 col1 col2 col3 1 microsoft a.exe 2 2 apple b.exe 1 3 vmware c.exe 3 4 delta 4.exe 4 5 microsoft asd.exe 5 > dataset2 col1 col2 col3 1 apple a.exe 3 2 cisco b.exe 4 3 vmware d.exe 1 4 delta 5.exe 5 5 microsoft asd.exe 2

I would like to print all the observations in dataset1 so as not to overlap with one in dataset2 (comparing both col1 and col2 in each), that in this case print everything except the last observation - observations 1 and 2 match on col2 , but not col1 , and observations 3 and 4 coincide on col1 , but not col2 , that is:

  col1 col2 col3 1: apple b.exe 1 2: delta 4.exe 4 3: microsoft a.exe 2 4: vmware c.exe 3

+5

r intersection

vardha Jul 13 '15 at 17:31

source share

2 answers

akrun · Answer 1 · 2015-07-13T17:43:29+0000

You can use anti_join from dplyr

  library(dplyr) anti_join(df1, df2, by = c('col1', 'col2')) # col1 col2 col3 #1 delta 4.exe -0.5836272 #2 vmware c.exe 0.4196231 #3 apple b.exe 0.5365853 #4 microsoft a.exe -0.5458808

data

  set.seed(24) df1 <- data.frame(col1 = c('microsoft', 'apple', 'vmware', 'delta', 'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'), col3=rnorm(5), stringsAsFactors=FALSE) set.seed(22) df2 <- data.frame(col1 = c( 'apple', 'cisco', 'proactive', 'dtex', 'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'), col3=rnorm(5), stringsAsFactors=FALSE)

MichaelChirico · Answer 2 · 2015-07-13T17:46:41+0000

data.table solution inspired by this :

 library(data.table) #1.9.5+ setDT(dataset1,key=c("col1","col2")) setDT(dataset2,key=key(dataset1)) dataset1[!dataset2] col1 col2 col3 1: apple b.exe 1 2: delta 4.exe 4 3: microsoft a.exe 2 4: vmware c.exe 3

You can also try without a key:

 library(data.table) #1.9.5+ setDT(dataset1); setDT(dataset2) dataset1[!dataset2,on=c("col1","col2")]

Disjoint values ​​in R

data

More articles:

Disjoint values in R