Import fixed width data file without line separator

I have fixed width files (.dbf) that don't have line separators. Here is what the two lines of this data file look like:

20141101 77h 3.210 0 3 20141102 76h 3.090 0 3 

The width of one row is c(8,4,7,41) for date (8), some temporary measure (4), data point (7) and some other columns that can be summed in one column of "rest" (41) . There is no separator after one line, and the next line is simply added to the first line. All time steps are basically written sequentially in one massive line. This file contains numbers, characters, and spaces exclusively.

With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to the absence of a line separator.

Is there an argument to tell read.fwf() when to start reading a new line when there is no line separator? Or should I use a different read command?

Thanks in advance.

+6
source share
3 answers

Another and probably less elegant solution with readLines , substr , trimws , separate (tidyr) and mutate_all (dplyr):

 txt <- readLines('filepath') dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60), function(x) substr(txt, x, x+59))) library(dplyr) library(tidyr) dfx %>% separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>% mutate_all(trimws) 

which gives:

  VA VB VC VD VE 1 20141101 77h 3.210 0 3 2 20141102 76h 3.090 0 3 

To get different column names, simply replace c(paste0("V",LETTERS[1:5]) vector of the columns you need.

If you want to convert the columns to the correct classes, and not to character , you can use funs(ul = type.convert(trimws(.))) Inside mutate_all .

+3
source

Maybe not a good idea, but this should work:

 content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file. # Split content in lines: lines <- regmatches(content,gregexpr('.{60}',content))[[1]] x <- tempfile() write(lines,x) data <- read.fwf(x, widths = c(8,4,7,41)) unlink(x) 

The idea is to read the entire file, get each of the 60 characters in one record, write it to the temp file and read the data from this temporary file before deleting the temporary file.

Another approach can be done with regular expressions and the stringr package (still with the content obtained from the verification above):

 library(stringr) d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE) 

which gives:

  V1 V2 V3 V4 1 20141101 77h 3.210 0 3 2 20141102 76h 3.090 0 3 

str_match_all return the list, here with 1 element, because there is only one line, so we delete it with [[1]] .

Now 5 columns are returned, the first of which is complete, the others by capture groups, so we multiply the matrix on columns 2 through 5 to get only 4 columns that we need, and wrap them in as.data.frame to get data .frame at the end.

you can then name the columns with colnames(d) <- c('date','time','data_point','rest')

If you want to clear spaces, you can wrap the result of str_extract_all in trimws (thanks @jaap for reminding me of this function) as follows:

 td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE) 

Output:

  X1 X2 X3 X4 1 20141101 77h 3.210 0 3 2 20141102 76h 3.090 0 3 
+4
source

In addition to other answers, general information about dbf files :

If this is a static file that has not been read once, it is best to check the file / field structure first if this changes over time. See here for the internal structure of the dbf file.

But perhaps even more important:

Each entry in the dbf file is preceded by one byte for the delete flag . If it is a space, the record is not deleted, if it has an asterisk * , the record is marked for deletion (records are not deleted from the dbf file until the file is packed), and you probably want to skip these records. The first piece of data can also be overwritten, for example, β€œ DELETED ”.

So, in your record c(8,4,7,41) last byte of the rest of the column (41) is actually a flag to delete the subsequent record, and the last record in the file will contain only 40 bytes for this field (but if you're lucky, the file has an EOF marker ( 0x1a ), so maybe you did not have a problem with the size there).

So your entry should actually be: c(1,8,4,7,40) , where 1 is the delete flag and start one byte earlier.

0
source

All Articles