Can I use the data connection capabilities of R. data.table to select rows and perform some operations?

Question

Can I use the data connection capabilities of R. data.table to select rows and perform some operations?

I'm not sure how to get the row indices resulting from combining two data.tables.

To set up a simplified example, suppose dt is a data table. Column "a", which is a letter from the alphabet, "b" is other information.

I want to add column “c” and set it to both “vowel” and “consonant” depending on column “a”. I have another dtv data table that serves as a vowel table. Can I use the join function of the data table for the effective operation of this operation?

require(data.table) dt <- data.table ( a = sample(letters, 25, replace = T), b = sample(50:100, 25, replace = F)) dtv <- data.table( vowel = c( 'a','e','i','o','u') ) setkey(dt,a)

The following line of code gives me a table of data.table lines with vowels

 dt[dtv, nomatch=0]

But how can I capture row indices so that I can mark a row as vowels or consonants?

 dt[, c := 'consonant'] dt[{ `a` found in vowel list }, c := 'vowel'] # I want to do this where column 'a' is a vowel

+6

join r data.table

Kerry Dec 7 '15 at 1:08

source share

2 answers

Actually there is no need to use merge / join. We can use %in% .

 dt[, c := "consonant"] dt[a %in% dtv$vowel, c := "vowel"]

or the same thing on the same line -

 dt[, c := "consonant"][a %in% dtv$vowel, c := "vowel"]

Alternatively (and better), we can complete both of these steps in the same call with the following.

 dt[, c := c("consonant", "vowel")[a %in% dtv$vowel + 1L]]

+3

Rich scriven Dec 7 '15 at 1:19

source share

David Arenburg · Accepted Answer · 2015-12-07T20:40:15+0000

Since V 1.9.4 data.table optimized to use a binary connection at %in% if a data set has already been entered. Therefore, @Richards answer should have the same performance for the latest versions of data.table (btw, %in% had an error while using datatable.auto.index = TRUE , so please make sure you have data.table installed data.table v 1.9.6+ if you are going to use it)

The following is an illustration of data.table using a binary connection using the %in% function

 require(data.table) set.seed(123) dt <- data.table ( a = sample(letters, 25, replace = T), b = sample(50:100, 25, replace = F)) dtv <- data.table( vowel = c( 'a','e','i','o','u') ) setkey(dt, a) options(datatable.verbose = TRUE) dt[a %in% dtv$vowel] # Starting bmerge ...done in 0 secs <~~~ binary join was triggered # ab # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77

Anyway, you were almost there, and you can easily change c by joining

 dt[, c := 'consonant'] dt[dtv, c := 'vowel']

Or, if you want to avoid joining unnecessary columns from dtv (if present), you can only join the first column in dtv

 dt[dtv$vowel, c := 'consonant']

Please note that I have not used .() Or J() . data.table will perform a binary join instead of indexing the default rows if the i th element is not of type integer or numeric . This is important if, for example, you want to perform a binary join on a column b (which is of type integer ). Compare

 setkey(dt, b) dt[80:85] # ab <~~~ binary join wan't triggered, instead an attempt to subset by rows 80:85 was made # 1: NA NA # 2: NA NA # 3: NA NA # 4: NA NA # 5: NA NA # 6: NA NA

and

 dt[.(80:85)] # or dt[J(80:85)] # Starting bmerge ...done in 0 secs <~~~ binary join was triggered # ab # 1: x 80 # 2: x 81 # 3: NA 82 # 4: NA 83 # 5: o 84 # 6: NA 85

Another difference between these two methods is that %in% will not return disparate instances, compare

 setkey(dt, a) dt[a %in% dtv$vowel] # Starting bmerge ...done in 0 secs # ab # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77

and

 dt[dtv$vowel] # Starting bmerge ...done in 0 secs # ab # 1: a NA <~~~ unmatched values returned # 2: e NA <~~~ unmatched values returned # 3: i 87 # 4: o 84 # 5: o 62 # 6: u 77

In this particular case, it does not matter, because := will not change the unsurpassed values, but you can use nomatch = 0L in other cases

 dt[dtv$vowel, nomatch = 0L] # Starting bmerge ...done in 0 secs # ab # 1: i 87 # 2: o 84 # 3: o 62 # 4: u 77

Remember to set options(datatable.verbose = FALSE) if you don't want data.table be so verbose.

Can I use the data connection capabilities of R. data.table to select rows and perform some operations?

More articles: