I am using an interpreter for a scripting language with a long outdated text editor, and I am having problems with the lexer working correctly.
Here is an example of the problem part of the language:
T L /LOCATE ME/ C /LOCATE ME/CHANGED ME/ * * C ;CHANGED ME;CHANGED ME AGAIN; 1 *
The / characters seem to indicate strings, and also act as a separator for the C ( CHANGE ) command in sed -type syntax, although it allows you to use any character as a separator.
I probably implemented about half of the most common commands, just using parse_tokens(line.split()) so far. It was fast and dirty, but it worked surprisingly well.
In order not to write my own lexer, I tried shlex .
This works very well, except in cases of CHANGE :
import shlex def shlex_test(cmd_str): lex = shlex.shlex(cmd_str) lex.quotes = '/' return list(lex) print(shlex_test('L /spaced string/'))
Does anyone know an easy way to accomplish this (using shlex or otherwise)?
EDIT:
If this helps, here is the CHANGE syntax specified in the help file:
''' C [/stg1/stg2/ [n|nm]] The CHANGE command replaces the m-th occurrence of "stg1" with "stg2" for the next n lines. The default value for m and n is 1.'''
It is similarly difficult tokenize the X and Y commands:
''' X [/command/[command/[...]]n] Y [/command/[command/[...]]n] The X and Y commands allow the execution of several commands contained in one command. To define an X or Y "command string", enter X (or Y) followed by a space, then individual commands, each separated by a delimiter (eg a period "."). An unlimited number of commands may be placed in the X or Y command string. Once the command string has been defined, entering X (or Y) followed optionally by a count n will execute the defined command string n times. If n is not specified, it will default to 1.'''