A faster way to read fixed-width files in R

I work with a large number of files with a fixed width (i.e. without a delimiter) that I need to read in R. Thus, there is usually a definition of the column width for parsing a string in variables. I can use read.fwf to read data without problems. However, for large files this can take a long time. For a recent dataset, it took 800 seconds to read in a dataset with ~ 500,000 rows and 143 variables.

 seer9 <- read.fwf("~/data/rawdata.txt", widths = cols, header = FALSE, buffersize = 250000, colClasses = "character", stringsAsFactors = FALSE)) 

fread in the data.table package in R is great for solving most data reading problems, except that it does not parse fixed-width files. However, I can read each line as a single character string (~ 500,000 lines, 1 column). It takes 3-5 seconds. (I like data.table.)

 seer9 <- fread("~/data/rawdata.txt", colClasses = "character", sep = "\n", header = FALSE, verbose = TRUE) 

There are many good posts on SO on how to parse text files. See the JHoward suggestion here to create a matrix of start and end columns and substr for data analysis. See GSee's suggestion here for using strsplit . I could not figure out how to make this work with this data. (In addition, Michael Smith made several suggestions on the data.table mailing list with sed , which were higher than my ability to implement. ) Now, using fread and substr() , I can do all this in about 25-30 seconds. Note that when you force a data table, a piece of time occurs at the end (5 seconds?).

 end_col <- cumsum(cols) start_col <- end_col - cols + 1 start_end <- cbind(start_col, end_col) # matrix of start and end positions text <- lapply(seer9, function(x) { apply(start_end, 1, function(y) substr(x, y[1], y[2])) }) dt <- data.table(text$V1) setnames(dt, old = 1:ncol(dt), new = seervars) 

I am wondering if this can be improved? I know that I'm not the only one who needs to read fixed-width files, so if this can be done faster, it will make downloading even larger files (with millions of lines) more tolerant. I tried using parallel with mclapply and data.table instead of lapply , but that didn't change anything. (Probably due to my inexperience in R.) I believe that the Rcpp function could have been written to do this very quickly, but this is beyond the scope of my skill set. In addition, I cannot use the application and apply accordingly.

My data.table implementation (with magrittr chain) takes the magrittr same time:

 text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% data.table(.) 

Can anyone make suggestions to improve the speed of this? Or is it about as good as it gets?

Here is the code to create a similar data table inside R (and not to bind to the actual data). It should have 331 characters and 500,000 lines. There are spaces to simulate missing fields in the data, but these are NOT data separated by spaces. (I read raw SEER data if someone is interested.) Also including column widths (cols) and variable names (seervars) if that helps someone else. These are the actual column and variable definitions for the SEER data.

 seer9 <- data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")), 500000)) cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2) seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG") 

UPDATE: LaF did all the reading in just 7 seconds from the .txt file. There may be an even faster way, but I doubt something can do much better. Awesome package.

July 27, 2015 Update Just wanted to provide a small update. I used the new readr package and I was able to read the whole file in 5 seconds using readr :: read_fwf.

 seer9_readr <- read_fwf("path_to_data/COLRECT.TXT", col_positions = fwf_widths(cols)) 

In addition, the updated stringi :: stri_sub function is at least twice as fast as base :: substr (). Thus, in the above code, which uses fread to read the file (about 4 seconds) and then used to parse each line, extracting 143 variables took about 8 seconds using stringi :: stri_sub compared to 19 for base :: substr . So fread plus stri_sub still takes about 12 seconds. Not bad.

 seer9 <- fread("path_to_data/COLRECT.TXT", colClasses = "character", sep = "\n", header = FALSE) text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% data.table(.) 

December 10, 2015 update:

See also below from @MichaelChirico, who added some great tests and the iotools package.

+32
substring r data.table apply lapply
Jul 12 '14 at 18:09
source share
4 answers

You can use the LaF package, which was written to handle large, fixed-width files (also too large to be inserted into memory). To use it, you first need to open the file using laf_open_fwf . Then you can index the resulting object like a normal normal data frame to read the data you need. In the example below, I read the entire file, but you can also read specific columns and / or rows:

 library(LaF) laf <- laf_open_fwf("foo.dat", column_widths = cols, column_types=rep("character", length(cols)), column_names = seervars) seer9 <- laf[,] 

Your example using 5,000 lines (instead of 500,000) took 28 seconds using read.fwf and 1.6 seconds using LaF .

Addition: Your example using 50,000 lines (instead of 500,000) took 258 seconds using read.fwf and 7 seconds using LaF on my machine.

+27
Jul 12 '14 at 19:57
source share

Now that there is (between this and another important question about efficiently reading fixed-width files), a fair number of options in a sentence for reading in such files, I think some benchmarking is appropriate.

For comparison, I will use the following file on the big side (400 MB). This is just a bunch of random characters with random margins and widths:

 set.seed(21394) wwidth = 400L rrows = 1000000 #creating the contents at random contents = write.table(replicate(rrows, paste0(sample(letters, wwidth, replace = TRUE), collapse = "")), file="testfwf.txt", quote = FALSE, row.names = FALSE, col.names = FALSE) #defining the fields & writing a dictionary n_fields = 40L endpoints = unique(c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)) cols = ist(beg = endpoints[-(n_fields + 1L)], end = endpoints[-1L] - 1L) dict = data.frame(column = paste0("V", seq_len(length(endpoints)) - 1L)), start = endpoints[-length(endpoints)] - 1, length = diff(endpoints)) write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE) 

I will compare the five methods mentioned between these two threads (I will add some others if the authors want): the basic version ( read.fwf ), transfers the result from in2csv to fread (@AnandaMahto), Hadley new readr ( read_fwf ) using LaF / ffbase ( ffbase suggestion) and the improved (optimized) version proposed by the question author (@MarkDanese) fread with stri_sub from stringi .

Here is the comparison code:

 library(data.table) library(stringi) library(readr) library(LaF); library(ffbase) library(microbenchmark) microbenchmark(times = 5L, utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE), in2csv = fread(paste("in2csv -f fixed -s", "~/Desktop/testdic.csv", "~/Desktop/testfwf.txt")), readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))), LaF = { my.data.laf = laf_open_fwf('testfwf.txt', column_widths=diff(endpoints), column_types = rep("character", length(endpoints) - 1L)) my.data = laf_to_ffdf(my.data.laf, nrows = rrows) as.data.frame(my.data)}, fread = fread( "testfwf.txt", header = FALSE, sep = "\n" )[ , lapply(seq_len(length(cols$beg)), function(ii) stri_sub(V1, cols$beg[ii], cols$end[ii]))]) 

And the conclusion:

 # Unit: seconds # expr min lq mean median uq max neval cld # utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598 5 c # in2csv 67.74065 68.56549 69.60069 70.11774 70.18746 71.39210 5 a # readr 10.57945 11.32205 15.70224 14.89057 19.54617 22.17298 5 a # LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798 5 b # fread 14.42617 15.44693 26.09877 15.76016 20.45481 64.40581 5 a 

So, it seems that readr and fread + stri_sub pretty competitive as the fastest; the built-in read.fwf is a clear loser.

Note that the real advantage of readr is that you can pre-specify column types; with fread you have to type afterwards.

EDIT: adding some alternatives

In @AnandaMahto's suggestion, I include a few more options, including one that seems like a new winner! To save time, I excluded the slowest options above in the new comparison. Here's the new code:

 library(iotools) microbenchmark(times = 5L, readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))), fread = fread( "testfwf.txt", header = FALSE, sep = "\n" )[ , lapply(seq_len(length(cols$beg)), function(ii) stri_sub(V1, cols$beg[ii], cols$end[ii]))], iotools = input.file("testfwf.txt", formatter = dstrfw, col_types = rep("character", length(endpoints) - 1L), widths = diff(endpoints)), awk = fread(paste( "awk -v FIELDWIDTHS='", paste(diff(endpoints), collapse = " "), "' -v OFS=', ' '{$1=$1 \"\"; print}' < ~/Desktop/testfwf.txt", collapse = " "), header = FALSE)) 

And a new conclusion:

 # Unit: seconds # expr min lq mean median uq max neval cld # readr 7.892527 8.016857 10.293371 9.527409 9.807145 16.222916 5 a # fread 9.652377 9.696135 9.796438 9.712686 9.807830 10.113160 5 a # iotools 5.900362 7.591847 7.438049 7.799729 7.845727 8.052579 5 a # awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156 5 b 

So, it looks like iotools both very fast and very consistent.

+20
Dec 09 '15 at 22:23
source share

I wrote a parser for this kind of thing yesterday, but it was a very specific kind of input to the header file, so I will show you how to format the width of the columns to be able to use it.

Convert your flat file to csv

First download the tool in question .

You can download the binary from the bin directory if you are in OS X Mavericks (where I compiled it), or compile it by going to src and using clang++ csv_iterator.cpp parse.cpp main.cpp -o flatfileparser .

The flat file analyzer needs two files: a CSV header file, in which every fifth element indicates the width of the variable (again, this is due to my extremely specific application), which you can create with:

 cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2) writeLines(sapply(c(-1, cols), function(x) paste0(',,,,', x)), '~/tmp/header.csv') 

and copy the resulting ~/tmp/header.csv to the same directory as your flatfileparser . Move the flat file to the same directory and you can run it in your flat file:

 ./flatfileparser header.csv yourflatfile 

which will produce yourflatfile.csv . Add the header you specified above manually using the pipeline ( >> from Bash).

Quick read in CSV file

Use the experimental hreadley file by passing the file name fastread::read_csv , which gives data.frame . I do not think it supports fwf files, although it is already on the way.

+3
Jul 12 '14 at 18:26
source share

I'm not sure which OS you are using, but this worked pretty easy for me on Linux:

Step 1 Create a command for awk to convert the file to csv

You can save it in a real CSV file if you plan to use the data in other software as well.

 myCommand <- paste( "awk -v FIELDWIDTHS='", paste(cols, collapse = " "), "' -v OFS=',' '{$1=$1 \"\"; print}' < ~/rawdata.txt", collapse = " ") 

Step 2 Use fread directly for this command you just created.

 seer9 <- fread(myCommand) 



I did not time it because I obviously use a slower system than you and Jan :-)

+2
Jul 13 '14 at 19:59
source share



All Articles