I upload several large tab delimited text files exported from a database (accessible to me) to R using data.table::fread . fread processes most files with great ease and speed, but one of the files generates a regularly reported fread error:
Error in fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE, : Expected sep (' ') but new line or EOF ends field ...
A smaller (2000 lines) version of the file containing the violation line ( RDS file ) is available here.
Here's how I tried to diagnose the problem until this point:
library(data.table) # I'm using 1.9.7 development (same error with 1.9.6) read_problem <- readRDS("read_problem.rds") error <- fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE, sep = "\t", colClasses = rep("character", 44), # For simplicity verbose = TRUE)
If I cut the line of violation, the problem disappears:
cat(read_problem, file = "temp") string_vec <- readLines("temp") clipped_vec <- string_vec[-1027] # Get rid of problem line 1027 restored <- paste(clipped_vec, collapse = "\n") noerror <- fread(restored, encoding = "UTF-8", na.strings = "", header = TRUE, sep = "\t", colClasses = rep("character", 44)) # For simplicity class(noerror) [1] "data.table" "data.frame" dim(noerror) [1] 1999 44
The error message seems clear enough: fread searches for "\ t", but finds something else in its place.
But I do not see anything obvious from a closer look at the offensive line relative to others.
The number of tab characters is the same
sapply(gregexpr("\t", string_vec[1026:1028]), length) [1] 43 43 43
The line break information seems identical
unlist(gregexpr("\n", string_vec[1026:1028])) [1] -1 -1 -1
Here, look at the offensive line as a line:
string_vec[1027] [1] "URN:CornellLabOfOrnithology:EBIRD:OBS132960387\t29816\tspecies\tNelson Sparrow\tAmmodramus nelsoni\t\t\t1\t\t\tUnited States\tUS\tGeorgia\tUS-GA\tGlynn\tUS-GA-127\tUS-GA_3181\t\t\tJekyll Island\tL140461\tH\t31.0464993\t-81.4113007\t1990-11-03\t13:15:00\t\"Jekyll Island and Causeway. Partly cloudy, mild, NE wind 8-15 mph. Note: Did very little birding in upland habitats as time available was rather brief.\" Data entered on behalf of Paul Sykes by Alison Huff ( arhuff@uga.edu ) on 12-15-11.\tListed on old Georgia Field Checklist as \"Sparrow, Sharp-tailed.\"\tobsr289931\tPaul\tSykes\tS9336358\teBird - Traveling Count\tEBIRD\t270\t8.047\t\t1\t1\t\t1\t0\t\t"
Any tips to get around this problem without manually extracting the damaged strings?