Design a module for analyzing a text file

I no longer believe in a general parser for text files, especially these files are intended for readers. Files such as HTML and the weblog can be processed with beautiful soap or regular expression. But the human readable text file is still a tough nut to crack.

It’s just that I’m ready to manually code the text file parser, adapting every other format I came across. I still want to see if it is possible to have a better program structure in such a way that I can still understand the program logic after 3 months along the way. Also to make it readable.

Today I was asked the problem of extracting temporary stamps from a file:

"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."

Parsing is simple. I have timestamps in different places on each line. But I think, how should I design a module / function in such a way that: (1) each format of the string will be processed separately, (2) how to go to the corresponding function. For example, I can encode each line parser as follows:

def parse_as(s):
    return s.split(' ')[2], s.split(' ')[2] # returning the second same as the first for the case that only one time stamp is found

def parse_between(s):
    return s.split(' ')[2], s.split(' ')[4]

def parse_during(s):
    return s.split(' ')[4], s.split(' ')[6]

This can help me quickly learn about the formats already processed by the program. I can always add a new function in case I come across another new format.

However, I still do not have an elegant way to transition to the corresponding function.

# open file
for l in f.readline():
    s = l.split(' ')
    if s == 'As': 
       ts1, ts2 = parse_as(l)
    else:
       if s == 'Between':
          ts1, ts2 = parse_between(l)
       else:
          if s == 'During':
             ts1, ts2 = parse_during(l)
          else:
             print 'error!'
    # process ts1 and ts2

This is not what I want to support.

Any suggestion? Once I thought that a decorator might help, but I could not figure it out myself. Appreciate if anyone can point me in the right direction.

+4
3

:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

for l in f.readline():
    s = l.split(' ')
    p = dmap.get(s, None)
    if p is None:
        print('error')
    else:
        ts1, ts2 = p(l)
        #continue to process

. , dmap :

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during,
    'After': parse_after,
    'Before': parse_before
    #and so on
}
+3

start_with = ["As", "Between", "During"]
parsers = [parse_as, parse_between, parse_during]


for l in f.readlines():
    match_found = False

    for start, f in zip(start_with, parsers):
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')

dict Ian:

rules = {
    "As": parse_as,
    "Between": parse_between,
    "During": parse_during
}

for l in f.readlines():
    match_found = False

    for start, f in rules.items():
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')
+1

Why not use regex?

import re

# open file
with open('datafile.txt') as f:
    for line in f:
        ts_vals = re.findall(r'(\d+:\d\d:\d\d)', line)
        # process ts1 and ts2

Thus, there ts_valswill be a list with one or two items for the provided examples.

0
source

All Articles