I work with a large number of files with a fixed width (i.e. without a delimiter) that I need to read in R. Thus, there is usually a definition of the column width for parsing a string in variables. I can use read.fwf to read data without problems. However, for large files this can take a long time. For a recent dataset, it took 800 seconds to read in a dataset with ~ 500,000 rows and 143 variables.
seer9 <- read.fwf("~/data/rawdata.txt", widths = cols, header = FALSE, buffersize = 250000, colClasses = "character", stringsAsFactors = FALSE))
fread in the data.table package in R is great for solving most data reading problems, except that it does not parse fixed-width files. However, I can read each line as a single character string (~ 500,000 lines, 1 column). It takes 3-5 seconds. (I like data.table.)
seer9 <- fread("~/data/rawdata.txt", colClasses = "character", sep = "\n", header = FALSE, verbose = TRUE)
There are many good posts on SO on how to parse text files. See the JHoward suggestion here to create a matrix of start and end columns and substr for data analysis. See GSee's suggestion here for using strsplit . I could not figure out how to make this work with this data. (In addition, Michael Smith made several suggestions on the data.table mailing list with sed , which were higher than my ability to implement. ) Now, using fread and substr() , I can do all this in about 25-30 seconds. Note that when you force a data table, a piece of time occurs at the end (5 seconds?).
end_col <- cumsum(cols) start_col <- end_col - cols + 1 start_end <- cbind(start_col, end_col)
I am wondering if this can be improved? I know that I'm not the only one who needs to read fixed-width files, so if this can be done faster, it will make downloading even larger files (with millions of lines) more tolerant. I tried using parallel with mclapply and data.table instead of lapply , but that didn't change anything. (Probably due to my inexperience in R.) I believe that the Rcpp function could have been written to do this very quickly, but this is beyond the scope of my skill set. In addition, I cannot use the application and apply accordingly.
My data.table implementation (with magrittr chain) takes the magrittr same time:
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% data.table(.)
Can anyone make suggestions to improve the speed of this? Or is it about as good as it gets?
Here is the code to create a similar data table inside R (and not to bind to the actual data). It should have 331 characters and 500,000 lines. There are spaces to simulate missing fields in the data, but these are NOT data separated by spaces. (I read raw SEER data if someone is interested.) Also including column widths (cols) and variable names (seervars) if that helps someone else. These are the actual column and variable definitions for the SEER data.
seer9 <- data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")), 500000)) cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2) seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG")
UPDATE: LaF did all the reading in just 7 seconds from the .txt file. There may be an even faster way, but I doubt something can do much better. Awesome package.
July 27, 2015 Update Just wanted to provide a small update. I used the new readr package and I was able to read the whole file in 5 seconds using readr :: read_fwf.
seer9_readr <- read_fwf("path_to_data/COLRECT.TXT", col_positions = fwf_widths(cols))
In addition, the updated stringi :: stri_sub function is at least twice as fast as base :: substr (). Thus, in the above code, which uses fread to read the file (about 4 seconds) and then used to parse each line, extracting 143 variables took about 8 seconds using stringi :: stri_sub compared to 19 for base :: substr . So fread plus stri_sub still takes about 12 seconds. Not bad.
seer9 <- fread("path_to_data/COLRECT.TXT", colClasses = "character", sep = "\n", header = FALSE) text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% data.table(.)
December 10, 2015 update:
See also below from @MichaelChirico, who added some great tests and the iotools package.