Ghost bypass or end of file (EOF) in data.table :: fread

I upload several large tab delimited text files exported from a database (accessible to me) to R using data.table::fread . fread processes most files with great ease and speed, but one of the files generates a regularly reported fread error:

 Error in fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE, : Expected sep (' ') but new line or EOF ends field ... 

A smaller (2000 lines) version of the file containing the violation line ( RDS file ) is available here.

Here's how I tried to diagnose the problem until this point:

 library(data.table) # I'm using 1.9.7 development (same error with 1.9.6) read_problem <- readRDS("read_problem.rds") error <- fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE, sep = "\t", colClasses = rep("character", 44), # For simplicity verbose = TRUE) 

If I cut the line of violation, the problem disappears:

 cat(read_problem, file = "temp") string_vec <- readLines("temp") clipped_vec <- string_vec[-1027] # Get rid of problem line 1027 restored <- paste(clipped_vec, collapse = "\n") noerror <- fread(restored, encoding = "UTF-8", na.strings = "", header = TRUE, sep = "\t", colClasses = rep("character", 44)) # For simplicity class(noerror) [1] "data.table" "data.frame" dim(noerror) [1] 1999 44 

The error message seems clear enough: fread searches for "\ t", but finds something else in its place.

But I do not see anything obvious from a closer look at the offensive line relative to others.

The number of tab characters is the same

 sapply(gregexpr("\t", string_vec[1026:1028]), length) [1] 43 43 43 

The line break information seems identical

 unlist(gregexpr("\n", string_vec[1026:1028])) [1] -1 -1 -1 

Here, look at the offensive line as a line:

 string_vec[1027] [1] "URN:CornellLabOfOrnithology:EBIRD:OBS132960387\t29816\tspecies\tNelson Sparrow\tAmmodramus nelsoni\t\t\t1\t\t\tUnited States\tUS\tGeorgia\tUS-GA\tGlynn\tUS-GA-127\tUS-GA_3181\t\t\tJekyll Island\tL140461\tH\t31.0464993\t-81.4113007\t1990-11-03\t13:15:00\t\"Jekyll Island and Causeway. Partly cloudy, mild, NE wind 8-15 mph. Note: Did very little birding in upland habitats as time available was rather brief.\" Data entered on behalf of Paul Sykes by Alison Huff ( arhuff@uga.edu ) on 12-15-11.\tListed on old Georgia Field Checklist as \"Sparrow, Sharp-tailed.\"\tobsr289931\tPaul\tSykes\tS9336358\teBird - Traveling Count\tEBIRD\t270\t8.047\t\t1\t1\t\t1\t0\t\t" 

Any tips to get around this problem without manually extracting the damaged strings?

+7
r error-handling dataframe data.table
source share
2 answers

With this commit , this is now fixed in v1.9.7 , the current development version. Therefore, the next stable version should be able to read it correctly using quote="" .

 require(data.table) #v1.9.7+ fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n', quote="") # V1 V2 V3 # 1: "abcd efgh." ijkl. mnop "qrst uvwx." 45 

On line 1027, at the end of "Sparrow, Sharp-tailed." There is only one tab. Where, as in other lines, after this field there are two before the start of the field "obsr [0-9]".

The number of tabs seems to correspond, because on line 1027 there is a tab before “Contributed to the old Georgia field” instead of space.

So row 1027 only gets 43 columns instead of 44. This seems to be the problem.


Looking at it again, it seems like Listed on old Georgia Field Checklist as "Sparrow, Sharp-tailed." should be read as a separate column, but instead read along with the previous column ...

Here is a smaller reproducible example:

 # note that there are only 2 instead of 3 columns fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n') # V1 V2 # 1: abcd efgh." ijkl.\tmnop "qrst uvwx. 45 # add a header column and it returns the same error fread('a\tb\tc\n"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n') # Error in fread("a\tb\tc\n\"abcd efgh.\" ijkl.\tmnop \"qrst uvwx.\"\t45\n") : # Expected sep (' ') but new line, EOF (or other non printing character) # ends field 1 when detecting types ( first): "abcd efgh." ijkl. mnop # "qrst uvwx." 45 

Filed 1367 .

+6
source share

One possible solution is:

  • Read all CSVs in one list

    df <-lapply (csv, function (x) read.csv (x, strAsAsFactors = FALSE))

Each list item represents one CSV.

  1. Convert List to One Big Data Frame

df2 <- ldply (df, data.frame)

  1. Delete the line containing EOF using grep as usual.

df3 <-df2 [! grepl ("eof", df2 $ V1),]

where V1 is the name of the column where the EOF is located.

0
source share

All Articles