I am trying to tokenize the following input in Python:
text = 'This @ example@ is "neither":/defn/neither complete[1] *nor* trite, *though _simple_*.'
I would like to do something like the following, avoiding the use of regular expressions:
tokens = [ ('text', 'This '), ('enter', 'code'), ('text', "example") ('exit', None), ('text', ' is '), ('enter', 'a'), ('text', "neither"), ('href', "/defn/neither"), ('exit', None), ('text', ' complete'), ('enter', 'footnote'), ('id', 1), ('exit', None), ('text', ' '), ('enter', 'strong'), ('text', 'nor'), ('exit', None), ('text', ' trite, '), ('enter', 'strong'), ('text', 'though '), ('enter', 'em'), ('text', 'simple'), ('exit', None), ('exit', None), ('text', '.') ]
It will pretend that the above is done by a generator. My current implementation works, although the code is somewhat disgusting and does not easily extend to link support.
Any help would be greatly appreciated.
Updated to change the desired syntax from a complex nested list structure to a simple stream of tuples. Indentation for us humans. The formatting in the link text is in order. Here is a simple parser that generates the result of the lexing that I am looking for, but still does not process links or footnotes.