Interpretation of this source text is a strategy?

Question

Interpretation of this source text is a strategy?

I have this raw text:

________________________________________________________________________________________________________________________________ Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap 1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228* 2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409 3 37 Bruce Cook Bruce Cook Ford Escort 3759 10 9:56.4388 4 0:58.3359 4 18 Troy Marinelli Troy Marinelli Nissan Silvia 3396 10 9:56.7758 2 0:58.4443 5 75 Anthony Gilbertson Anthony Gilbertson BMW M3 3200 10 10:02.5842 3 0:58.9336 6 26 Trent Purcell Trent Purcell Mazda RX7 2354 10 10:07.6285 4 0:59.0546 7 12 Scott Hunter Scott Hunter Toyota Corolla 2000 10 10:11.3722 5 0:59.8921 8 91 Graeme Wilkinson Graeme Wilkinson Ford Escort 2000 10 10:13.4114 5 1:00.2175 9 7 Justin Wade Justin Wade BMW M3 4000 10 10:18.2020 9 1:00.8969 10 55 Greg Craig Grag Craig Toyota Corolla 1840 10 10:18.9956 7 1:00.7905 11 46 Kyle Orgam-Moore Kyle Organ-Moore Holden VS Commodore 6000 10 10:30.0179 3 1:01.6741 12 39 Uptiles Strathpine Trent Spencer BMW Mini Cooper S 1500 10 10:40.1436 2 1:02.2728 13 177 Mark Hyde Mark Hyde Ford Escort 1993 10 10:49.5920 2 1:03.8069 14 34 Peter Draheim Peter Draheim Mazda RX3 2600 10 10:50.8159 10 1:03.4396 15 5 Scott Douglas Scott Douglas Datsun 1200 1998 9 9:48.7808 3 1:01.5371 16 72 Paul Redman Paul Redman Ford Focus 2lt 9 10:11.3707 2 1:05.8729 17 8 Matthew Speakman Matthew Speakman Toyota Celica 1600 9 10:16.3159 3 1:05.9117 18 74 Lucas Easton Lucas Easton Toyota Celica 1600 9 10:16.8050 6 1:06.0748 19 77 Dean Fuller Dean Fuller Mitsubishi Sigma 2600 9 10:25.2877 3 1:07.3991 20 16 Brett Batterby Brett Batterby Toyota Corolla 1600 9 10:29.9127 4 1:07.8420 21 95 Ross Hurford Ross Hurford Toyota Corolla 1600 8 9:57.5297 2 1:12.2672 DNF 13 Charles Wright Charles Wright BMW 325i 2700 9 9:47.9888 7 1:03.2808 DNF 20 Shane Satchwell Shane Satchwell Datsun 1200 Coupe 1998 1 1:05.9100 1 1:05.9100 Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph R=under lap record by greatest margin, r=under lap record, *=fastest lap time ________________________________________________________________________________________________________________________________ Issue# 2 - Printed Sat May 26 15:43:31 2012 Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results Amended

I need to parse it into an object with explicit fields Position, Car, Driver etc. The problem is that I have no idea which strategy to use. If I split it into spaces, I would get a list like this:

 ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]

You see the problem. I can't just interpret this list because people can only have one name or 3 words in a name or many different words in a car. This makes it impossible to just link to a list using only indexes.

How about using offsets defined by column names? I can’t understand how this can be used.

Change Therefore, the current algorithm that I use works as follows:

Divide the text on a new line, creating a set of lines.
Find common FURTHEST RIGHT whitespace on each line. That is, positions (indices) on each line, where each other line contains spaces. EG:
Separate lines based on these common characters.
Trim line

There are several problems:

If the names contain the same lengths:

 Jason Adams Bobby Sacka Jerry Louis

He will then interpret this as two separate elements: ([ "Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"] ).

If they all differ like this:

 Dominic Bou Bob Adams Jerry Seinfeld

It will then be correctly broken down into the last 'd' in Seinfeld (and in this way we get a collection of three names ( ["Dominic Bou", "Bob Adams", "Jerry Seinfeld"] ).

It is also quite fragile. I am looking for a more pleasant solution.

+7

language-agnostic ruby text parsing screen-scraping

Dominic Bou-Samra May 28 '12 at 23:06

source share

10 answers

This is not very good for regex, you really want to find the format and then unzip the lines:

 lines = str.split "\n" # you know the field names so you can use them to find the column positions fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap'] header = lines.shift until header =~ /^Pos/ positions = fields.map{|f| header.index f} # use that to construct an unpack format string format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join # A4A5A31A25A21A6A12A10 lines.each do |line| next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in data = line.unpack(format).map{|x| x.strip} puts data.join(', ') # or better yet... car = Hash[fields.zip data] puts car['Driver'] end

+6

pguardiario May 29 '12 at 5:54

source share

http://blog.ryanwood.com/past/2009/6/12/slither-a-dsl-for-parsing-fixed-width-text-files this may solve your problem.

here are some examples and github.

Hope this helps!

+6

Bhushan lodha Jul 19 '12 at 9:02

source share

I think it’s simple enough to use a fixed width in each row.

 #!/usr/bin/env ruby # ruby parsing_winner.rb winners_list.txt args = ARGV puts "ruby parsing_winner.rb winners_list.txt " if args.empty? winner_file = open args.shift array_of_race_results, array_of_race_results_array = [], [] class RaceResult attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap) @position = position @car = car @team = team @driver = driver @vehicle = vehicle @cap = cap @cl_laps = cl_laps @race_time = race_time @fastest = fastest @fastest_lap = fastest_lap end def to_a # ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"] [position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap] end end # Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap # 1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228* # 2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409 # etc... winner_file.each_line do |line| next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/] position = line[0..3].strip car = line[4..8].strip team = line[9..39].strip driver = line[40..64].strip vehicle = line[65..85].strip cap = line[86..91].strip cl_laps = line[92..101].strip race_time = line[102..113].strip fastest = line[114..116].strip fastest_lap = line[117..-1].strip racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap) array_of_race_results << racer array_of_race_results_array << racer.to_a end puts "Race Results Objects: #{array_of_race_results}" puts "Race Results: #{array_of_race_results_array.inspect}"

Output =>

 Race Results Objects: [#<RaceResult:0x007fcc4a84b7c8 @position="1", @car="6", @team="Jason Clements", @driver="Jason Clements", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="9:48.5710", @fastest="3", @fastest_lap="0:57.3228*">, #<RaceResult:0x007fcc4a84aa08 @position="2", @car="42", @team="David Skillender", @driver="David Skillender", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="9:55.6866", @fastest="2", @fastest_lap="0:57.9409">, #<RaceResult:0x007fcc4a849ce8 @position="3", @car="37", @team="Bruce Cook", @driver="Bruce Cook", @vehicle="Ford Escort", @cap="3759", @cl_laps="10", @race_time="9:56.4388", @fastest="4", @fastest_lap="0:58.3359">, #<RaceResult:0x007fcc4a8491f8 @position="4", @car="18", @team="Troy Marinelli", @driver="Troy Marinelli", @vehicle="Nissan Silvia", @cap="3396", @cl_laps="10", @race_time="9:56.7758", @fastest="2", @fastest_lap="0:58.4443">, #<RaceResult:0x007fcc4b091ab8 @position="5", @car="75", @team="Anthony Gilbertson", @driver="Anthony Gilbertson", @vehicle="BMW M3", @cap="3200", @cl_laps="10", @race_time="10:02.5842", @fastest="3", @fastest_lap="0:58.9336">, #<RaceResult:0x007fcc4b0916a8 @position="6", @car="26", @team="Trent Purcell", @driver="Trent Purcell", @vehicle="Mazda RX7", @cap="2354", @cl_laps="10", @race_time="10:07.6285", @fastest="4", @fastest_lap="0:59.0546">, #<RaceResult:0x007fcc4b091298 @position="7", @car="12", @team="Scott Hunter", @driver="Scott Hunter", @vehicle="Toyota Corolla", @cap="2000", @cl_laps="10", @race_time="10:11.3722", @fastest="5", @fastest_lap="0:59.8921">, #<RaceResult:0x007fcc4b090e88 @position="8", @car="91", @team="Graeme Wilkinson", @driver="Graeme Wilkinson", @vehicle="Ford Escort", @cap="2000", @cl_laps="10", @race_time="10:13.4114", @fastest="5", @fastest_lap="1:00.2175">, #<RaceResult:0x007fcc4b090a78 @position="9", @car="7", @team="Justin Wade", @driver="Justin Wade", @vehicle="BMW M3", @cap="4000", @cl_laps="10", @race_time="10:18.2020", @fastest="9", @fastest_lap="1:00.8969">, #<RaceResult:0x007fcc4b090668 @position="10", @car="55", @team="Greg Craig", @driver="Grag Craig", @vehicle="Toyota Corolla", @cap="1840", @cl_laps="10", @race_time="10:18.9956", @fastest="7", @fastest_lap="1:00.7905">, #<RaceResult:0x007fcc4b090258 @position="11", @car="46", @team="Kyle Orgam-Moore", @driver="Kyle Organ-Moore", @vehicle="Holden VS Commodore", @cap="6000", @cl_laps="10", @race_time="10:30.0179", @fastest="3", @fastest_lap="1:01.6741">, #<RaceResult:0x007fcc4b08fe48 @position="12", @car="39", @team="Uptiles Strathpine", @driver="Trent Spencer", @vehicle="BMW Mini Cooper S", @cap="1500", @cl_laps="10", @race_time="10:40.1436", @fastest="2", @fastest_lap="1:02.2728">, #<RaceResult:0x007fcc4b08fa38 @position="13", @car="177", @team="Mark Hyde", @driver="Mark Hyde", @vehicle="Ford Escort", @cap="1993", @cl_laps="10", @race_time="10:49.5920", @fastest="2", @fastest_lap="1:03.8069">, #<RaceResult:0x007fcc4b08f628 @position="14", @car="34", @team="Peter Draheim", @driver="Peter Draheim", @vehicle="Mazda RX3", @cap="2600", @cl_laps="10", @race_time="10:50.8159", @fastest="10", @fastest_lap="1:03.4396">, #<RaceResult:0x007fcc4b08f218 @position="15", @car="5", @team="Scott Douglas", @driver="Scott Douglas", @vehicle="Datsun 1200", @cap="1998", @cl_laps="9", @race_time="9:48.7808", @fastest="3", @fastest_lap="1:01.5371">, #<RaceResult:0x007fcc4b08ee08 @position="16", @car="72", @team="Paul Redman", @driver="Paul Redman", @vehicle="Ford Focus", @cap="2lt", @cl_laps="9", @race_time="10:11.3707", @fastest="2", @fastest_lap="1:05.8729">, #<RaceResult:0x007fcc4b08e9f8 @position="17", @car="8", @team="Matthew Speakman", @driver="Matthew Speakman", @vehicle="Toyota Celica", @cap="1600", @cl_laps="9", @race_time="10:16.3159", @fastest="3", @fastest_lap="1:05.9117">, #<RaceResult:0x007fcc4b08e5e8 @position="18", @car="74", @team="Lucas Easton", @driver="Lucas Easton", @vehicle="Toyota Celica", @cap="1600", @cl_laps="9", @race_time="10:16.8050", @fastest="6", @fastest_lap="1:06.0748">, #<RaceResult:0x007fcc4b08e1d8 @position="19", @car="77", @team="Dean Fuller", @driver="Dean Fuller", @vehicle="Mitsubishi Sigma", @cap="2600", @cl_laps="9", @race_time="10:25.2877", @fastest="3", @fastest_lap="1:07.3991">, #<RaceResult:0x007fcc4b08ddc8 @position="20", @car="16", @team="Brett Batterby", @driver="Brett Batterby", @vehicle="Toyota Corolla", @cap="1600", @cl_laps="9", @race_time="10:29.9127", @fastest="4", @fastest_lap="1:07.8420">, #<RaceResult:0x007fcc4a848348 @position="21", @car="95", @team="Ross Hurford", @driver="Ross Hurford", @vehicle="Toyota Corolla", @cap="1600", @cl_laps="8", @race_time="9:57.5297", @fastest="2", @fastest_lap="1:12.2672">, #<RaceResult:0x007fcc4a847948 @position="DNF", @car="13", @team="Charles Wright", @driver="Charles Wright", @vehicle="BMW 325i", @cap="2700", @cl_laps="9", @race_time="9:47.9888", @fastest="7", @fastest_lap="1:03.2808">, #<RaceResult:0x007fcc4a847010 @position="DNF", @car="20", @team="Shane Satchwell", @driver="Shane Satchwell", @vehicle="Datsun 1200 Coupe", @cap="1998", @cl_laps="1", @race_time="1:05.9100", @fastest="1", @fastest_lap="1:05.9100">] Race Results: [["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"], ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2", "0:57.9409"], ["3", "37", "Bruce Cook", "Bruce Cook", "Ford Escort", "3759", "10", "9:56.4388", "4", "0:58.3359"], ["4", "18", "Troy Marinelli", "Troy Marinelli", "Nissan Silvia", "3396", "10", "9:56.7758", "2", "0:58.4443"], ["5", "75", "Anthony Gilbertson", "Anthony Gilbertson", "BMW M3", "3200", "10", "10:02.5842", "3", "0:58.9336"], ["6", "26", "Trent Purcell", "Trent Purcell", "Mazda RX7", "2354", "10", "10:07.6285", "4", "0:59.0546"], ["7", "12", "Scott Hunter", "Scott Hunter", "Toyota Corolla", "2000", "10", "10:11.3722", "5", "0:59.8921"], ["8", "91", "Graeme Wilkinson", "Graeme Wilkinson", "Ford Escort", "2000", "10", "10:13.4114", "5", "1:00.2175"], ["9", "7", "Justin Wade", "Justin Wade", "BMW M3", "4000", "10", "10:18.2020", "9", "1:00.8969"], ["10", "55", "Greg Craig", "Grag Craig", "Toyota Corolla", "1840", "10", "10:18.9956", "7", "1:00.7905"], ["11", "46", "Kyle Orgam-Moore", "Kyle Organ-Moore", "Holden VS Commodore", "6000", "10", "10:30.0179", "3", "1:01.6741"], ["12", "39", "Uptiles Strathpine", "Trent Spencer", "BMW Mini Cooper S", "1500", "10", "10:40.1436", "2", "1:02.2728"], ["13", "177", "Mark Hyde", "Mark Hyde", "Ford Escort", "1993", "10", "10:49.5920", "2", "1:03.8069"], ["14", "34", "Peter Draheim", "Peter Draheim", "Mazda RX3", "2600", "10", "10:50.8159", "10", "1:03.4396"], ["15", "5", "Scott Douglas", "Scott Douglas", "Datsun 1200", "1998", "9", "9:48.7808", "3", "1:01.5371"], ["16", "72", "Paul Redman", "Paul Redman", "Ford Focus", "2lt", "9", "10:11.3707", "2", "1:05.8729"], ["17", "8", "Matthew Speakman", "Matthew Speakman", "Toyota Celica", "1600", "9", "10:16.3159", "3", "1:05.9117"], ["18", "74", "Lucas Easton", "Lucas Easton", "Toyota Celica", "1600", "9", "10:16.8050", "6", "1:06.0748"], ["19", "77", "Dean Fuller", "Dean Fuller", "Mitsubishi Sigma", "2600", "9", "10:25.2877", "3", "1:07.3991"], ["20", "16", "Brett Batterby", "Brett Batterby", "Toyota Corolla", "1600", "9", "10:29.9127", "4", "1:07.8420"], ["21", "95", "Ross Hurford", "Ross Hurford", "Toyota Corolla", "1600", "8", "9:57.5297", "2", "1:12.2672"], ["DNF", "13", "Charles Wright", "Charles Wright", "BMW 325i", "2700", "9", "9:47.9888", "7", "1:03.2808"], ["DNF", "20", "Shane Satchwell", "Shane Satchwell", "Datsun 1200 Coupe", "1998", "1", "1:05.9100", "1", "1:05.9100"]]

+5

earlonrails Jul 22 '12 at 19:48

source share

Depending on how formatting is consistent, you can probably use regex for this.

Here is an example of a regular expression that works for current data, it may need to be tuned depending on the exact rules, but it gives an idea:

 ^ # Pos (\d+|DNF) \s+ #Car (\d+) \s+ # Team ([\w-]+(?: [\w-]+)+) \s+ # Driver ([\w-]+(?: [\w-]+)+) \s+ # Vehicle ([\w-]+(?: ?[\w-]+)+) \s+ # Cap (\d{4}|\dlt) \s+ # CL Laps (\d+) \s+ # Race.Time (\d+:\d+\.\d+) \s+ # Fastest Lap (\d+) \s+ # Fastest Lap Time (\d+:\d+\.\d+\*?) \s* $

+4

Peter Boughton May 28 '12 at 23:18

source share

If you can verify that the spaces are space characters, not tabs, and this overlapping text is always truncated to fit the column structure, then I would hardcode the slice borders:

 parsed = [rawLine[0:3],rawLine[4:7],rawLine[9:38], ...etc... ]

Depending on the data source, this can be fragile (if, for example, each run has different column widths).

If the title bar is always the same, you can extract the borders of the slice by searching for known words in the title bar.

+4

Russell Borogove May 28 '12 at 23:20

source share

Ok, I gotchu:

Change I forgot to mention, assuming you saved the input text in the variable input_string

 # Choose a delimeter that is unlikely to occure DELIM = '|||' # DRY -> extend String class String def split_on_spaces(min_spaces = 1) self.strip.gsub(/\s{#{min_spaces},}/, DELIM).split(DELIM) end end # just get the data lines lines = input_string.split("\n") lines = lines[2...(lines.length - 4)].delete_if { |line| line.empty? } # Grab all the entries into a nice 2-d array entries = lines.map { |line| [ line[0..8].split_on_spaces, line[9..85].split_on_spaces(3).map{ |string| string.gsub(/\s+/, ' ') # replace whitespace with 1 space }, line[85...line.length].split_on_spaces(2) ].flatten } # BONUS # Make nice hashes keys = [:pos, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest_lap] objects = entries.map { |entry| Hash[keys.zip entry] }

Outputs:

 entries # => ["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3 0:57.3228*"] ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2 0:57.9409"] ... # all of length 9, no extra spaces

And in the case when arrays just don't cut it

 objects # => {:pos=>"1", :car=>"6", :team=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fastest_lap=>"3 0:57.3228*"} {:pos=>"2", :car=>"42", :team=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fastest_lap=>"2 0:57.9409"} ...

I leave refactoring for you to nice features.

+4

Ajcodez Jul 20 '12 at 20:10

source share

If there is no clear rule for separating columns, you cannot do this.

The approach you have is good if you know that each column value is correctly deferred to the column heading.

Another approach might be to group words separated by exactly one space (from the text you provided, I see that this rule also holds).

+3

Luchian grigore May 28 '12 at 23:10

source share

Assuming that the text will always be divided into the same one, you can split the line based on the position, and then remove the extra spaces around each part. For example in python:

 pos=row[0:3].strip() car=row[4:7].strip()

etc. Alternatively, you can define a regex to capture each part:

 ([:alnum:]+)\s([:num:]+)\s(([:alpha:]+ )+)\s(([:alpha:]+ )+)\s(([:alpha:]* )+)\s

etc. (The exact syntax depends on your regular expression grammar.) Note that a regular expression for a car should handle added spaces.

+2

Joshua smith May 28 '12 at 23:21

source share

I'm not going to code this, but one way that definitely works for the above dataset is to parse it with a space, and then assign the elements as follows:

 someArray = array of strings that were split by white space Pos = someArray[0] Car = someArray[1] Competitor/Team = someArray[2] + " " + someArray[3] Driver = someArray[4] + " " + someArray[5] Vehicle = someArray[6] + " " + ... + " " + someArray[someArray.length - 6] Cap = someArray[someArray.length - 5] CL Laps = someArray[someArray.length - 4] Race.Time = someArray[someArray.length - 3] Fastest...Lap = someArray[someArray.length - 2] + " " + someArray[someArray.length - 1]

Part of the car can be performed by some for or while loop.

+1

Tom prats Jul 25 '12 at 4:19

source share

Mark thomas · Accepted Answer · 2012-07-20T01:41:05+0000

You can use the fixed_width .

This file can be analyzed with the following code:

 require 'fixed_width' require 'pp' FixedWidth.define :cars do |d| d.head do |head| head.trap { |line| line !~ /\d/ } end d.body do |body| body.trap { |line| line =~ /^(\d|DNF)/ } body.column :pos, 4 body.column :car, 5 body.column :competitor, 31 body.column :driver, 25 body.column :vehicle, 21 body.column :cap, 5 body.column :cl_laps, 11 body.column :race_time, 11 body.column :fast_lap_no, 4 body.column :fast_lap_time, 10 end end pp FixedWidth.parse(File.open("races.txt"), :cars)

The trap method defines lines in each section. I used the regex:

The regular expression head looks for strings that do not contain numbers.
The body regular expression searches for lines starting with a digit or "DNF"

Each section should include a line immediately after the last. The column definitions simply determine the number of columns to capture. The library shares spaces for you. If you want to create a file with a fixed width, you can add alignment options, but you won’t need it.

The result is a hash that starts as follows:

 {:head=>[{}, {}, {}], :body=> [{:pos=>"1", :car=>"6", :competitor=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fast_lap_no=>"3", :fast_lap_time=>"0:57.3228"}, {:pos=>"2", :car=>"42", :competitor=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fast_lap_no=>"2", :fast_lap_time=>"0:57.9409"},

Interpretation of this source text is a strategy?

More articles: