Is it possible to use the identifier data.table-join-assign-idiom to perform left join and assign NA in inconsistent lines from i to x?

Yesterday I gave this answer: Match data tables with five columns to change the value in another column .

In the comments, OP asked if we can effectively achieve the left join of the two tables and thereby get NSs that will cause the right table to be assigned to the left table. It seems to me that data.table does not provide any facilities for this.

Here is an example that I used in this question:

set.seed(1L); dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L)); dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L)); dt1; ## id V1 V2 blah1 ## 1: 1 1 1 -0.6264538 ## 2: 2 2 1 0.1836433 ## 3: 3 3 1 -0.8356286 ## 4: 4 1 2 1.5952808 ## 5: 5 2 2 0.3295078 ## 6: 6 3 2 -0.8204684 ## 7: 7 1 3 0.4874291 ## 8: 8 2 3 0.7383247 ## 9: 9 3 3 0.5757814 ## 10: 10 1 4 -0.3053884 ## 11: 11 2 4 1.5117812 ## 12: 12 3 4 0.3898432 dt2; ## id V1 V2 blah2 ## 1: 13 1 1 -0.62124058 ## 2: 14 2 1 -2.21469989 ## 3: 15 1 2 1.12493092 ## 4: 16 2 2 -0.04493361 ## 5: 17 1 3 -0.01619026 ## 6: 18 2 3 0.94383621 key <- paste0('V',1:2); 

And here is the solution I gave that does not get NA for inappropriate rows:

 dt1[dt2,on=key,id:=i.id]; dt1; ## id V1 V2 blah1 ## 1: 13 1 1 -0.6264538 ## 2: 14 2 1 0.1836433 ## 3: 3 3 1 -0.8356286 ## 4: 15 1 2 1.5952808 ## 5: 16 2 2 0.3295078 ## 6: 6 3 2 -0.8204684 ## 7: 17 1 3 0.4874291 ## 8: 18 2 3 0.7383247 ## 9: 9 3 3 0.5757814 ## 10: 10 1 4 -0.3053884 ## 11: 11 2 4 1.5117812 ## 12: 12 3 4 0.3898432 

We need id values ​​of 12 and lower that remain in dt1 to replace NA (not because they are 12 or lower, and not because these id values ​​are not in dt2 , but because the connection is in the key columns, namely V1 and V2 does not match for these lines in dt1 from dt2 ).

As I said in the comments on this question, a workaround is to pre-assign dt1$id all NA, and then start indexing-destination-connection. Therefore, this is the expected result:

 dt1$id <- NA; dt1[dt2,on=key,id:=i.id]; dt1; ## id V1 V2 blah1 ## 1: 13 1 1 -0.6264538 ## 2: 14 2 1 0.1836433 ## 3: NA 3 1 -0.8356286 ## 4: 15 1 2 1.5952808 ## 5: 16 2 2 0.3295078 ## 6: NA 3 2 -0.8204684 ## 7: 17 1 3 0.4874291 ## 8: 18 2 3 0.7383247 ## 9: NA 3 3 0.5757814 ## 10: NA 1 4 -0.3053884 ## 11: NA 2 4 1.5117812 ## 12: NA 3 4 0.3898432 

I think the workaround is fine, but I'm not sure why data.table seems not to be able to perform this function in one shot using the index list operation. Below are three dead ends that I have learned:

1: nomatch

data.table provides the nomatch argument, which is a bit like the all , all.x and all.y merge() arguments. This is actually a very limited argument; it only allows you to go from the right join ( nomatch=NA , by default) to the inner join ( nomatch=0 ). We cannot reach the left connection with it.

2: flip dt1 and dt2

Since dt1[dt2] is the right join, we can just flip it, which means dt2[dt1] , to get the corresponding left join.

This will not work because we need to use the syntax := in the argument j to assign to dt1 , and under the inverted call, we assign dt2 instead. I tried to assign i.id under an inverted command, but this did not affect the original dt1 .

3: use merge.data.table()

We can call merge.data.table() with the argument all.x=T to reach the left join. Now the problem is that merge.data.table() has no j argument, and it just does not provide any means to assign a column to the left (or right) table.


So, is it possible to perform this operation at all with data.table? And if so, what is the best way to do this?

+7
r left-join data.table assign
source share
1 answer

AFAIU you just want to find the id column from dt2 to dt1 . The original id variable in dt1 does not seem to be related to the whole process when you join V1,V2 , and you don't want to have dt1$id values ​​as a result. Thus, the technically correct way to resolve this issue is to not use this column at all.

 set.seed(1) library(data.table) dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L)); dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L)); on = paste0("V",1:2) # I rename to `on` to not mask `key` function dt1[,id:=NULL ][dt2,on=on,id:=i.id ][] # V1 V2 blah1 id # 1: 1 1 -0.6264538 13 # 2: 2 1 0.1836433 14 # 3: 3 1 -0.8356286 NA # 4: 1 2 1.5952808 15 # 5: 2 2 0.3295078 16 # 6: 3 2 -0.8204684 NA # 7: 1 3 0.4874291 17 # 8: 2 3 0.7383247 18 # 9: 3 3 0.5757814 NA #10: 1 4 -0.3053884 NA #11: 2 4 1.5117812 NA #12: 3 4 0.3898432 NA 

Besides the question ...
- you do not need to use ; at the end of the line, if to define only one expression, use dt1[, id := NA_integer_] instead of dt1$id <- NA
- use set.seed when providing rnorm code and other randomness related calls

+8
source share

All Articles