Data.table subset NaN not working

Question

Data.table subset NaN not working

I have a column in a data table with NaN values. Sort of:

 my.dt <- data.table(x = c(NaN, NaN, NaN, .1, .2, .2, .3), y = c(2, 4, 6, 8, 10, 12, 14)) setkey(my.dt, x)

I can use the J() function to search for all instances where column x is .2

 > my.dt[J(.2)] xy 1: 0.2 10 2: 0.2 12

But if I try to do the same with NaN , this will not work.

 > my.dt[J(NaN)] xy 1: NaN NA

I would expect:

  xy 1: NaN 2 2: NaN 4 3: NaN 6

What gives? I cannot find anything in the data.table documentation to explain why this is happening (although it may just be that I do not know what to look for). Is there a way to get what I want? Ultimately, I would like to replace all NaN values with zero using something like my.dt[J(NaN), x := 0]

+5

r data.table

Wilduck Oct 08 '13 at 2:07

source share

3 answers

Here's a quick workaround that relies heavily on what is actually going on inside (making the code a bit fragile imo). Since internally, NaN is a very very negative number, it will always be at the beginning of your data.table when you setkey . We can use this property to highlight such entries:

 # this will give the index of the first element that is *not* NaN my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] # this is equivalent to my.dt[!is.nan(x)], but much faster my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]

Here is an example of Ricardo trial data:

 my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE))) setnames(my.dt, 1, "ID") my.dt[sample(1e5, 1e3), ID := NA] setkey(my.dt, ID) # NOTE: I have to use integer max here - because this example has integers # instead of doubles, so I'll just add simple helper function (that would # likely need to be extended for other cases, but I'm just dealing with the ones here) minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax library(microbenchmark) microbenchmark(normalJ = my.dt[J(1)], naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)]) #Unit: milliseconds # expr min lq median uq max neval # normalJ 1.645442 1.864812 2.120577 2.863497 5.431828 100 # naJ 1.465806 1.689350 2.030425 2.600720 10.436934 100

In my tests, the following minN function also covers symbolic and logical vectors:

 minN = function(x) { if (is.integer(x)) { -.Machine$integer.max } else if (is.numeric(x)) { -.Machine$double.xmax } else if (is.character(x)) { "" } else if (is.logical(x)) { FALSE } else { NA } }

And you need to add mult = 'first' , for example:

 my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]

+3

eddi Oct 08 '13 at 16:06

source share

See if this is helpful.

 my.dt[!is.finite(x),] xy 1: NaN 2 2: NaN 4 3: NaN 6

0

42- Oct 08 '13 at 3:10

source share

Ricardo saporta · Accepted Answer · 2013-10-08T05:00:44+0000

Update: This was fixed some time ago, in version 1.9.2. From NEWS :

NA , NaN , +Inf and -Inf now considered different values, can be in keys, can be combined and can be grouped. data.table defines: NA <NaN <-Inf. Thanks to Martin Liberts for the suggestions, # 4684, # 4815 and # 4883.

 require(data.table) ## 1.9.2+ my.dt[J(NaN)] # xy # 1: NaN 2 # 2: NaN 4 # 3: NaN 6

This problem is the choice of part design, part error. There are a few questions about SO and a few mailing lists that examine NA in data.table .

The basic idea described in the FAQ is that NA treated as FALSE

Please feel free to listen to the conversation on the mailing list. There was a conversation started by @Arun,

http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html

You can also learn more in the answers and comments to any of the following SO questions:

a subset of data.table using! = <some non-NA> excludes NA too
NA in the expression `i` data.table (possible error)
DT [! (x ==.)] and DT [x! =.] process NA in x inconsistently

At the same time, it is best to use is.na
Although it is slower than a radius search, it is still faster than most vector searches in R , and certainly much, much faster than any fancy workarounds

 library(microbenchmark) microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)]) # Unit: milliseconds expr median my.dt[.(1)] 1.309948 my.dt[is.na(ID)] 3.444689 <~~ Not bad my.dt[ID == 1] 4.005093 my.dt[!(!(!(ID)))] 10.038134 ### using the following for my.dt my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE))) setnames(my.dt, 1, "ID") my.dt[sample(1e5, 1e3), ID := NA] setkey(my.dt, ID)

Data.table subset NaN not working

Please feel free to listen to the conversation on the mailing list. There was a conversation started by @Arun,

You can also learn more in the answers and comments to any of the following SO questions:

More articles: