I no longer believe in a general parser for text files, especially these files are intended for readers. Files such as HTML and the weblog can be processed with beautiful soap or regular expression. But the human readable text file is still a tough nut to crack.
It’s just that I’m ready to manually code the text file parser, adapting every other format I came across. I still want to see if it is possible to have a better program structure in such a way that I can still understand the program logic after 3 months along the way. Also to make it readable.
Today I was asked the problem of extracting temporary stamps from a file:
"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."
Parsing is simple. I have timestamps in different places on each line. But I think, how should I design a module / function in such a way that: (1) each format of the string will be processed separately, (2) how to go to the corresponding function. For example, I can encode each line parser as follows:
def parse_as(s):
return s.split(' ')[2], s.split(' ')[2]
def parse_between(s):
return s.split(' ')[2], s.split(' ')[4]
def parse_during(s):
return s.split(' ')[4], s.split(' ')[6]
This can help me quickly learn about the formats already processed by the program. I can always add a new function in case I come across another new format.
However, I still do not have an elegant way to transition to the corresponding function.
for l in f.readline():
s = l.split(' ')
if s == 'As':
ts1, ts2 = parse_as(l)
else:
if s == 'Between':
ts1, ts2 = parse_between(l)
else:
if s == 'During':
ts1, ts2 = parse_during(l)
else:
print 'error!'
This is not what I want to support.
Any suggestion? Once I thought that a decorator might help, but I could not figure it out myself. Appreciate if anyone can point me in the right direction.