Subset rules

Having df1 and df2 as follows:

 df1 <- read.table(text =" xyz 1 1 1 1 2 1 1 1 2 2 1 1 2 2 2",header=TRUE) df2 <- read.table(text =" abc 1 1 1 1 2 8 1 1 2 2 6 2",header=TRUE) 

I can request things from the data, such as:

  df2[ df2$b == 6 | df2$c == 8 ,] #any rows where b=6 plus c=8 in df2 #and additive conditions df2[ df2$b == 6 & df2$c == 8 ,] # zero rows 

between data.frame:

  df1[ df1$z %in% df2$c ,] # rows in df1 where values in z are in c (allrows) 

This gives me all the lines:

  df1[ (df1$x %in% df2$a) & (df1$y %in% df2$b) & (df1$z %in% df2$c) ,] 

but this should not give me all df1 lines too:

  df1[ df1$z %in% df2$c | df1$b == 9,] 

What I really hope to do is a subset of df1 a df2 in three columns, so that I only get rows in df1, where a, b, c are all equal to x, y, z inside the row at the same time. In real data, I will have more than three columns, but I still want a subset in 3 additive column conditions.

So, a subset of the data from my df1 example to df2 , my result is:

 df1 1 1 1 1 1 2 

Playing with the syntax is even more confused, and SO posts - all I want is actually leading to more confusion for me.

I realized that I can do this:

  merge(df1,df2, by.x=c("x","y","z"),by.y=c("a","b","c")) 

which gives me what I want, but I would like to understand why I am mistaken in my attempts [ .

+4
source share
1 answer

In addition to your good solution using merge (thanks for that, I always forgot merge ), this can be done in the database using ?interaction as follows. There may be other options for this, but this is the one I am familiar with:

 > df1[interaction(df1) %in% interaction(df2), ] 

Now, to answer your question: firstly, I think there is a typo (fixed) in:

 df1[ df1$z %in% df2$c | df2$b == 9,] # second part should be df2$b == 9 

You would get an error because the first part evaluates

 [1] TRUE TRUE TRUE TRUE TRUE 

and the second is rated as:

 [1] FALSE FALSE FALSE FALSE 

You are performing an operation | at unequal lengths, getting the error:

 longer object length is not a multiple of shorter object length 

Edit: If you have multiple columns, you can choose the interaction as such. For example, if you want to get rows from df1 in which the first two columns match the rows from df2 , you can simply do:

 > df1[interaction(df1[, 1:2]) %in% interaction(df2[, 1:2]), ] 
+5
source

All Articles