Reading aligned column data with fread

I came across a file like this:

COL1 COL2 COL3 weqw asrg qerhqetjw weweg ethweth rqerhwrtjw rhqerhqerhq qergqer qerhqew5h qerh qergqer wetjwryerj 

I could not load it with fread , so I replaced \s+ with , with sed , than I gave fread, and he solved it. But is there a built-in way to read such data using data.table ?

+5
source share
2 answers

fread is not yet able to read fixed-width files .

I also often come across files annoyingly stored like that. Feel free to add a feature request on the Github page.

In your case, this may not be the case, but your sed solution will not work on the set of FWFs that I come across, because there is no space between the columns, for example. you will see lines like 00010 which actually contain 3 fields.

In this case, you will need a field width dictionary, after which you have several options:

  • read.fwf in R
  • Write a program fwf csv (I use the one I wrote in Python , and it pretty quickly, could share the code if you want) - basically an extended version of your original approach, so you no longer have to deal with FWF
  • Open it in Excel / LibreOffice / etc; There is a built-in FWF reader that tries (usually badly) to guess the column widths, which at least does half the work of specifying the column widths for you. Then you can save it as .csv or something there.

I personally adhere to the second option most often. read.fwf not optimized like fread , so it will probably be slow. And if you have a lot (say, 20+) of FWF to read, the third option is quite tedious.

But I agree that it would be nice to have something like this built in to fread .

+3
source

Fixed in the current devel (v1.9.5). Update and test (and report if there are problems).

 require(data.table) # v1.9.5+ fread("~/Downloads/tmp.txt") # COL1 COL2 COL3 # 1: weqw asrg qerhqetjw # 2: weweg ethweth rqerhwrtjw # 3: rhqerhqerhq qergqer qerhqew5h # 4: qerh qergqer wetjwryerj 

fread() received the argument strip.white (default = TRUE ) among the other arguments. Please check README on the project page for current news.

+1
source

All Articles