F # Reading a text file with a fixed width

Hi, I am looking to find a better way to read in a fixed width text file using F #. The file will be in plain text, from one to several thousand lines and about 1000 characters wide. Each row contains about 50 fields, each of which has a different length. My initial thoughts were as follows:

type MyRecord = { Name : string Address : string Postcode : string Tel : string } let format = [ (0,10) (10,50) (50,7) (57,20) ] 

and read each line one by one, assigning each field to the root format (where the first element is the start character and the second is the number of characters).

Any pointers would be appreciated.

+5
source share
3 answers

The hardest part is probably to split one row according to the format of the column. This can be done something like this:

 let splitLine format (line : string) = format |> List.map (fun (index, length) -> line.Substring(index, length)) 

This function is of type (int * int) list -> string -> string list . In other words, format is an (int * int) list . This exactly matches your format list. The argument to line is string , and the function returns a string list .

You can match a list of these lines:

 let result = lines |> List.map (splitLine format) 

You can also use Seq.map or Array.map , depending on how the lines determined. Such a result would be a string list list , and now you can match such a list to create a MyRecord list .

You can use File.ReadLines to get a lazily evaluated sequence of lines from a file.

Please note that the above is only a diagram of a possible solution. I left border checks, error handling, etc. The above code may contain "one by one" errors.

+3
source

A record of 50 fields is a bit cumbersome, so alternative approaches that allow you to dynamically generate a data structure may be preferable (for example, System.Data.DataRow ).

If in any case it should be a record, you can reserve at least a manual assignment for each record field and fill it out with Reflection instead. This trick relies on the order of the fields as they are defined. I assume that each fixed-width column represents a record field, so starting indexes are implied.

 open Microsoft.FSharp.Reflection type MyRecord = { Name : string Address : string City : string Postcode : string Tel : string } with static member CreateFromFixedWidth format (line : string) = let fields = format |> List.fold (fun (index, acc) length -> let str = line.[index .. index + length - 1].Trim() index + length, box str :: acc ) (0, []) |> snd |> List.rev |> List.toArray FSharpValue.MakeRecord( typeof<MyRecord>, fields ) :?> MyRecord 

Sample data:

 "Postman Pat " + "Farringdon Road " + "London " + "EC1A 1BB" + "+44 20 7946 0813" |> MyRecord.CreateFromFixedWidth [16; 16; 16; 8; 16] // val it : MyRecord = {Name = "Postman Pat"; // Address = "Farringdon Road"; // City = "London"; // Postcode = "EC1A 1BB"; // Tel = "+44 20 7946 0813";} 
+1
source

Here's a solution with an emphasis on custom validation and error handling for each field. This may be redundant for a data file consisting only of numerical data!

Firstly, for such things I like to use the parser in Microsoft.VisualBasic.dll , as it is already available without using NuGet.

For each row, we can return an array of fields and a row number (for error messages)

 #r "Microsoft.VisualBasic.dll" // for each row, return the line number and the fields let parserReadAllFields fieldWidths textReader = let parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader=textReader) parser.SetFieldWidths fieldWidths parser.TextFieldType <- Microsoft.VisualBasic.FileIO.FieldType.FixedWidth seq {while not parser.EndOfData do yield parser.LineNumber,parser.ReadFields() } 

Next, we need a small error handling library (for more details see http://fsharpforfunandprofit.com/rop/ )

 type Result<'a> = | Success of 'a | Failure of string list module Result = let succeedR x = Success x let failR err = Failure [err] let mapR f xR = match xR with | Success a -> Success (fa) | Failure errs -> Failure errs let applyR fR xR = match fR,xR with | Success f,Success x -> Success (fx) | Failure errs,Success _ -> Failure errs | Success _,Failure errs -> Failure errs | Failure errs1, Failure errs2 -> Failure (errs1 @ errs2) 

Then define your domain model. In this case, it is a record type with a field for each field in the file.

 type MyRecord = {id:int; name:string; description:string} 

And then you can define your domain-specific analysis code. For each field, I created a validation function ( validateId , validateName , etc.). Fields that do not need to be validated can pass through raw data ( validateDescription ).

In fieldsToRecord various fields are combined using the applicative style ( <!> And <*> ). See http://fsharpforfunandprofit.com/posts/elevated-world-3/#validation for more on this.

Finally, readRecords maps each line of input to a Result record and selects only successful ones. Bad entries are written to the handleResult .

 module MyFileParser = open Result let createRecord id name description = {id=id; name=name; description=description} let validateId (lineNo:int64) (fields:string[]) = let rawId = fields.[0] match System.Int32.TryParse(rawId) with | true, id -> succeedR id | false, _ -> failR (sprintf "[%i] Can't parse id '%s'" lineNo rawId) let validateName (lineNo:int64) (fields:string[]) = let rawName = fields.[1] if System.String.IsNullOrWhiteSpace rawName then failR (sprintf "[%i] Name cannot be blank" lineNo ) else succeedR rawName let validateDescription (lineNo:int64) (fields:string[]) = let rawDescription = fields.[2] succeedR rawDescription // no validation let fieldsToRecord (lineNo,fields) = let (<!>) = mapR let (<*>) = applyR let validatedId = validateId lineNo fields let validatedName = validateName lineNo fields let validatedDescription = validateDescription lineNo fields createRecord <!> validatedId <*> validatedName <*> validatedDescription /// print any errors and only return good results let handleResult result = match result with | Success record -> Some record | Failure errs -> printfn "ERRORS %A" errs; None /// return a sequence of records let readRecords parserOutput = parserOutput |> Seq.map fieldsToRecord |> Seq.choose handleResult 

Here is an example of analysis in practice:

 // Set up some sample text let text = """01name1description1 02name2description2 xxname3badid------- yy badidandname """ // create a low-level parser let textReader = new System.IO.StringReader(text) let fieldWidths = [| 2; 5; 11 |] let parserOutput = parserReadAllFields fieldWidths textReader // convert to records in my domain let records = parserOutput |> MyFileParser.readRecords |> Seq.iter (printfn "RECORD %A") // print each record 

The result will look like this:

 RECORD {id = 1; name = "name1"; description = "description";} RECORD {id = 2; name = "name2"; description = "description";} ERRORS ["[3] Can't parse id 'xx'"] ERRORS ["[4] Can't parse id 'yy'"; "[4] Name cannot be blank"] 

This is by no means the most efficient way to analyze a file (I think that some CSV analyzing libraries are available in NuGet that can perform validation on parsing), but it shows how you can have full control over checking and processing errors if you need it. necessary.

+1
source

All Articles