Read Big Width Fixed Data

How to read big data with fixed width? I read this question and tried some tips, but all answers are for delimited data (like .csv), and that is not my business. The data has 558 MB, and I do not know how many rows.

I use:

dados <- read.fwf('TS_MATRICULA_RS.txt', width=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', colClasses=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'character', 'character', 'integer', 'integer', 'character', 'integer', 'integer', 'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'character', 'integer', 'integer', 'character', 'character', 'character', 'character', 'integer', 'character', 'character', 'character', 'character', 'character', 'character', 'character', 'character'), buffersize=180000) 

But reading the data takes 30 minutes (and counting ...). Any new suggestions?

+19
r bigdata
Sep 10 '13 at 13:18
source share
3 answers

It is difficult to give a concrete answer without details about your data, but here are a few ideas to help you get started:

First, if you are on a Unix system, you can get some information about your file with the wc command. For example, wc -l TS_MATRICULA_RS.txt tells you how many lines in your file and wc -l TS_MATRICULA_RS.txt tells you the length of the longest line in your file. It may be helpful to know. Similarly, head and tail allow you to check the first and last 10 lines of your text file.

Secondly, some suggestions: since it seems that you know the width of each field, I would recommend one of two approaches.

Option 1: csvkit + your favorite method for quickly reading big data

csvkit is a Python toolkit for working with CSV files. One tool is in2csv , which takes a fixed-width file in combination with a "schema" file to create a proper CSV that can be used with other programs.

The schema file itself is a CSV file with three columns: (1) the name of the variable, (2) the starting position, and (3) the width. Example (from the in2csv man page):

  column,start,length name,0,30 birthday,30,10 age,40,3 

Once you have created this file, you can use something like:

 in2csv -f fixed -s path/to/schemafile.csv path/to/TS_MATRICULA_RS.txt > TS_MATRICULA_RS.csv 

From there, I would suggest looking at the data from fread to "data.table" or using sqldf .

Option 2: sqldf using substr

Using sqldf in a file with large values, like yours, should be pretty fast, and you get the opportunity to specify what exactly you want to read when using substr .

Again, this will mean that you have access to the schema file as described above. Once you have the schema file, you can do the following:

 temp <- read.csv("mySchemaFile.csv") ## Construct your "substr" command GetMe <- paste("select", paste("substr(V1, ", temp$start, ", ", temp$length, ") `", temp$column, "`", sep = "", collapse = ", "), "from fixed", sep = " ") ## Load "sqldf" library(sqldf) ## Connect to your file fixed <- file("TS_MATRICULA_RS.txt") myDF <- sqldf(GetMe, file.format = list(sep = "_")) 



Since you know the width, you can skip generating the schema file. From the width, this is just a little work with cumsum . Here is a basic example based on the first example from read.fwf :

 ff <- tempfile() cat(file = ff, "123456", "987654", sep = "\n") read.fwf(ff, widths = c(1, 2, 3)) widths <- c(1, 2, 3) length <- cumsum(widths) start <- length - widths + 1 column <- paste("V", seq_along(length), sep = "") GetMe <- paste("select", paste("substr(V1, ", start, ", ", widths, ") `", column, "`", sep = "", collapse = ", "), "from fixed", sep = " ") library(sqldf) ## Connect to your file fixed <- file(ff) myDF <- sqldf(GetMe, file.format = list(sep = "_")) myDF unlink(ff) 
+10
10 Sep '13 at 18:11
source share

The LaF package is very good at reading fixed-width files very quickly. I use it daily to upload files of +/- 100Mio records with 30 columns (not as many character columns as you have - mostly numerical data and some factors). And it's pretty fast. So this is what I will do.

 library(LaF) library(ffbase) my.data.laf <- laf_open_fwf('TS_MATRICULA_RS.txt', column_widths=c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1), stringsAsFactors=FALSE, comment.char='', column_types=c('integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'categorical', 'categorical', 'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'integer', 'categorical', 'integer', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'integer', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical', 'categorical')) my.data <- laf_to_ffdf(my.data.laf, nrows=1000000) my.data.in.ram <- as.data.frame(my.data) 

PS. I started using the LaF package because I was annoyed by the slowness of read.fwf, and because the PL / SQL PostgreSQL code I was working with originally parsed the data was becoming a support problem.

+9
Sep 10 '13 at 18:37
source share

Here is a clean R solution using the new readr package created by Hadley Wickham and the RStudio team, released in April 2015. More details here . The code is as simple as this one:

 library(readr) my.data.frame <- read_fwf('TS_MATRICULA_RS.txt', fwf_widths(c(5, 13, 14, 3, 3, 5, 4, 6, 6, 6, 1, 1, 1, 4, 3, 2, 9, 3, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 4, 11, 9, 2, 3, 9, 3, 2, 9, 9, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1)), progress = interactive()) 

Benefits of read_fwf{readr}

  • readr is in LaF , but surprisingly faster . He showed a fasted method for reading fixed-width files in R
  • easier than alternatives. for example, you don’t need to worry about column_types , because they will be imputed from the first 30 lines in the input.
  • It contains a progress bar;)
+6
Sep 28 '15 at 11:30
source share



All Articles