Fread () error with missing values ​​in integer64 columns

When reading text below fread() not possible to detect missing values ​​in columns 8 and 9. This is only with the default option integer64="integer64" . Setting integer64="double" or "character" correctly detects NA s. Note that the file has three types of possible NA in V8 and V9 - ,, ; , , ; and NA . Adding na.strings=c("NA","N/A",""," "), sep="," as parameters has no effect.

Using read.csv() works just like fread(integer="double") .

Text to read (also available as integer64_and_NA.csv file ):

 2012,276,,0,"S1","001",1,,724135215,1590915056, 2012,276,2,8,"S1","001",1, ,,154598,0 2012,276,2,12,"S1","001",1,NA,5118863,21819477, 2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0 

Here is the output from fread() :

 DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",") Input contains no \n. Taking this to be a filename to open Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok Found 11 columns First row with 11 fields occurs on line 1 (either column names or first row of data) Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names. Count of eol after first data row: 5 Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows Type codes: 11114412221 (first 5 rows) Type codes: 11114412221 (after applying colClasses and integer64) Type codes: 11114412221 (after applying drop or select (if supplied) Allocating 11 column slots (11 - 0 NULL) 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Column type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 4x11 result (xMB) in RAM 0.000s ( 0%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.001s Total 

Summary data table:

 DT V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 
Values

integer64 are correctly displayed in columns that are not integer64 . For V8 and V9, which fread() marked as integer64, instead of NAs we have "9218868437227407266". Interestingly, str() returns the corresponding values ​​of V8 and V9 as NA :

 str(DT) Classes 'data.table' and 'data.frame': 4 obs. of 11 variables: $ V1 : int 2012 2012 2012 2012 $ V2 : int 276 276 276 276 $ V3 : int NA 2 2 2 $ V4 : int 0 8 12 0 $ V5 : chr "S1" "S1" "S1" "S1" $ V6 : chr "001" "001" "001" "011" $ V7 : int 1 1 1 8 $ V8 :Class 'integer64' num [1:4] NA NA NA 1.55e-314 $ V9 :Class 'integer64' num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314 $ V10:Class 'integer64' num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314 $ V11: int NA 0 NA 0 - attr(*, ".internal.selfref")=<externalptr> 

... but sees nothing else like NA :

 is.na(DT$V8) [1] FALSE FALSE FALSE FALSE max(DT$V8) integer64 [1] 9218868437227407266 > max(DT$V8, na.rm=TRUE) integer64 [1] 9218868437227407266 > class(DT$V8) [1] "integer64" > typeof(DT$V8) [1] "double" 

This seems to be not only a print / screen data.table , data.table sees them as huge integers:

 DT[, V12:=as.numeric(V8)] Warning message: In as.double.integer64(V8) : integer precision lost while converting to double > DT V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 9.218868e+18 2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 9.218868e+18 3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 9.218868e+18 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 3.127134e+09 

Am I missing something about integer64 , or is this a bug? As mentioned above, I can get by using integer64="double" , possibly losing some precision, as indicated in the help file. But unexpected behavior occurs by default integer64 ...

This was done on a 64-bit machine running Windows 8.1 with Revolution R 3.0.2, as well as on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 dated February 7, 2014) and 1.8.11 (version 1110, 2014-02-04 02:43:19, manually installed from zip as r-forge build broken) on Windows, and only the stable version 1.8.10 on linux. bit64 is installed and downloaded on both machines.

 > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] bit64_0.9-3 bit_1.1-11 gdata_2.13.2 xts_0.9-7 zoo_1.7-10 nlme_3.1-113 hexbin_1.26.3 lattice_0.20-24 ggplot2_0.9.3.1 [10] plyr_1.8 reshape2_1.2.2 data.table_1.8.11 Revobase_7.0.0 RevoMods_7.0.0 RevoScaleR_7.0.0 loaded via a namespace (and not attached): [1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 foreach_1.4.1 gtable_0.1.2 gtools_3.2.1 iterators_1.0.6 [9] labeling_0.2 MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 reshape_0.8.4 scales_0.2.3 stringr_0.6.2 [17] tools_3.0.2 
+7
r data.table
source share
2 answers

This is apparently a problem with the bit64 package, not fread() or data.table . From the bit64 documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf

"Signing non-existent elements and signing with NA is currently not supported. Currently, subtyping returns 9218868437227407266 instead of NA (NA value of invalid double code). Due to the full behavior of R, it can either destroy performance or require extensive C encoding."

I tried reassigning the value 9218868437227407266 NA, thinking it would work

Ex.

 DT[V8==9218868437227407266, ] #actually returns nothing, but DT[V8==max(V8), ] #returns the rows with 9218868437227407266 in V8 #but this does not reassign the value DT[V8==max(V8), V8:=NA] #not that this makes sense, but I tried just in case... DT[V8==max(V8), V8:=NA_character_] 

Since the documentation states quite clearly that if the vector is an integer64 class, it will not recognize NA or missing values. I am going to avoid bit64 just so as not to deal with this ...

+3
source share

This error, # 488 , is now fixed with this commit in the development version of data.table v1.9.5 , and the values ​​are assigned (and displayed) correctly as NA if bit64 loaded.

 require(data.table) # v1.9.5 require(bit64) ans = fread("test.csv") # V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 # 1: 2012 276 NA 0 S1 001 1 NA 724135215 1590915056 NA # 2: 2012 276 2 8 S1 001 1 NA NA 154598 0 # 3: 2012 276 2 12 S1 001 1 NA 5118863 21819477 NA # 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 
+6
source share

All Articles