Toxic Complex Input

Question

Toxic Complex Input

I am trying to tokenize the following input in Python:

text = 'This @ example@ is "neither":/defn/neither complete[1] *nor* trite, *though _simple_*.'

I would like to do something like the following, avoiding the use of regular expressions:

 tokens = [ ('text', 'This '), ('enter', 'code'), ('text', "example") ('exit', None), ('text', ' is '), ('enter', 'a'), ('text', "neither"), ('href', "/defn/neither"), ('exit', None), ('text', ' complete'), ('enter', 'footnote'), ('id', 1), ('exit', None), ('text', ' '), ('enter', 'strong'), ('text', 'nor'), ('exit', None), ('text', ' trite, '), ('enter', 'strong'), ('text', 'though '), ('enter', 'em'), ('text', 'simple'), ('exit', None), ('exit', None), ('text', '.') ]

It will pretend that the above is done by a generator. My current implementation works, although the code is somewhat disgusting and does not easily extend to link support.

Any help would be greatly appreciated.

Updated to change the desired syntax from a complex nested list structure to a simple stream of tuples. Indentation for us humans. The formatting in the link text is in order. Here is a simple parser that generates the result of the lexing that I am looking for, but still does not process links or footnotes.

+4

python stream text tokenize parsing

amcgregor Aug 22 '11 at 3:41

source share

1 answer

amcgregor · Accepted Answer · 2011-08-23T06:28:45+0000

Well, here's a more complete parser with enough extensibility to do what I might need in the future. Only three hours passed. This is not terribly fast, but overall the output of the parser class that I am writing is strongly cached anyway. Even with this tokenizer and parser in place, my full engine still works at <75% SLoC by default for python textile rendering, remaining somewhat faster. All without regular expressions.

The parsing of footnotes remains, but this is negligible compared to parsing. Conclusion (from this publication):

 tokens = [ ('text', 'This '), ('enter', 'code'), ('text', 'example'), ('exit', None), ('text', ' is '), ('enter', 'a'), ('text', 'neither'), ('attr', ('href', '/defn/neither')), ('exit', None), ('text', ' complete[1] '), ('enter', 'strong'), ('text', 'nor'), ('exit', None), ('text', ' trite, '), ('enter', 'strong'), ('text', 'though '), ('enter', 'em'), ('text', 'simple'), ('exit', None), ('exit', None), ('text', '.') ]

Toxic Complex Input

More articles: