R column names of .table data do not work within function

I am trying to use data.table inside a function and I am trying to understand why my code is not working. I have data.table as follows:

DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4)) > DT my_name my_id 1: A 2 2: B 2 3: C 3 4: D 3 5: E 4 6: F 4 

I am trying to create all pairs of "my_name" with different values โ€‹โ€‹of "my_id", which for DT will be:

 Var1 Var2 AC AD AE AF BC BD BE BF CE CF DE DF 

I have a function to return all pairs of "my_name" for a given pair of "my_id" values, which works as expected.

 get_pairs <- function(id1,id2,tdt) { return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name])) } > get_pairs(2,3,DT) Var1 Var2 1 AC 2 BC 3 AD 4 BD 

Now I want to execute this function for all pairs of identifiers that I am trying to do by finding all pairs of identifiers and then using mapply with the get_pairs function.

 > combn(unique(DT$my_id),2) [,1] [,2] [,3] [1,] 2 2 3 [2,] 3 4 4 tid1 <- combn(unique(DT$my_id),2)[1,] tid2 <- combn(unique(DT$my_id),2)[2,] mapply(get_pairs, tid1, tid2, DT) Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) : object 'my_id' not found 

Again, if I try to do the same without mapply, it works.

 get_pairs3(tid1[1],tid2[1],DT) Var1 Var2 1 AC 2 BC 3 AD 4 BD 

Why does this function only work when used in mapply? I think this has something to do with the namespace data.table, but I'm not sure.

Alternatively, is there another / more efficient way to accomplish this task? I have a big data.table with the third identifier "sample", and I need to get all these pairs for each sample (for example, work with DT [sample == "sample_id",]). I am new to data.table package and I cannot use it in the most efficient way.

+7
r data.table mapply
source share
3 answers

Why does this function only work when used in mapply? I think this has something to do with the namespace data.table, but I'm not sure.

The reason the function fails, in this case, has nothing to do with defining the area. mapply vectorize a function, it takes each element of each parameter and goes to the function. So, in your case, the data.table elements are its columns, so mapply passes the column my_name instead of the full data.table .

If you want to pass the full data.table to mapply , you should use the MoreArgs parameter. Then your function will work:

 res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE) do.call("rbind", res) Var1 Var2 1 AC 2 BC 3 AD 4 BD 5 AE 6 BE 7 AF 8 BF 9 CE 10 DE 11 CF 12 DF 
+3
source share

List all possible pairs

 u_name <- unique(DT$my_name) all_pairs <- CJ(u_name,u_name)[V1 < V2] 

List observed pairs

 obs_pairs <- unique( DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id", with=FALSE] ) 

Take the difference

 all_pairs[!J(obs_pairs)] 

CJ is similar to expand.grid , except that it creates a data table with all its columns as its key. To work with data.table X you must enter a key to join X[J(Y)] or non-join X[!J(Y)] (for example, the last line). J is optional, but makes it more obvious that we are making a connection.


Simplification. @CathG pointed out that there is a cleaner way to build obs_pairs if you always have two sorted "names" for each "id" (as in the data example): use as.list(un) instead of CJ(un,un)[V1 < V2] .

+4
source share

The debugonce() function is extremely useful in these scenarios.

 debugonce(mapply) mapply(get_pairs, tid1, tid2, DT) # Hit enter twice # from within BROWSER debugonce(FUN) # Hit enter twice # you'll be inside your function, and then type DT DT # [1] "A" "B" "C" "D" "E" "F" Q # (to quit debugging mode) 

which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case, you have provided a data table, which is also a list. Thus, instead of transferring the entire data table, it passes each list item (columns).

So you can get around this by doing:

 mapply(get_pairs, tid1, tid2, list(DT)) 

But mapply() simplifies the default result, and so you get matrix back. You will need to use SIMPLIFY = FALSE .

 mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE) 

Or just use Map :

 Map(get_pairs, tid1, tid2, list(DT)) 

Use rbindlist() to bind the results.

NTN

+3
source share

All Articles