The most efficient way to analyze this scripting language

Question

The most efficient way to analyze this scripting language

I am using an interpreter for a scripting language with a long outdated text editor, and I am having problems with the lexer working correctly.

Here is an example of the problem part of the language:

T L /LOCATE ME/ C /LOCATE ME/CHANGED ME/ * * C ;CHANGED ME;CHANGED ME AGAIN; 1 *

The / characters seem to indicate strings, and also act as a separator for the C ( CHANGE ) command in sed -type syntax, although it allows you to use any character as a separator.

I probably implemented about half of the most common commands, just using parse_tokens(line.split()) so far. It was fast and dirty, but it worked surprisingly well.

In order not to write my own lexer, I tried shlex .

This works very well, except in cases of CHANGE :

 import shlex def shlex_test(cmd_str): lex = shlex.shlex(cmd_str) lex.quotes = '/' return list(lex) print(shlex_test('L /spaced string/')) # OK! gives: ['L', '/spaced string/'] print(shlex_test('C /spaced string/another string/ * *')) # gives : ['C', '/spaced string/', 'another', 'string/', '*', '*'] # desired : any format that doesn't split on a space between /'s print(shlex_test('C ;ab;ba;')) # gives : ['C', ';', 'b', 'a', ';', 'a', 'b', ';'] # desired : same format as CHANGE command above

Does anyone know an easy way to accomplish this (using shlex or otherwise)?

EDIT:

If this helps, here is the CHANGE syntax specified in the help file:

 ''' C [/stg1/stg2/ [n|nm]] The CHANGE command replaces the m-th occurrence of "stg1" with "stg2" for the next n lines. The default value for m and n is 1.'''

It is similarly difficult tokenize the X and Y commands:

 ''' X [/command/[command/[...]]n] Y [/command/[command/[...]]n] The X and Y commands allow the execution of several commands contained in one command. To define an X or Y "command string", enter X (or Y) followed by a space, then individual commands, each separated by a delimiter (eg a period "."). An unlimited number of commands may be placed in the X or Y command string. Once the command string has been defined, entering X (or Y) followed optionally by a count n will execute the defined command string n times. If n is not specified, it will default to 1.'''

+7

python lexer shlex

Robbie rosati Jul 19 '12 at 16:51

source share

1 answer

sevenforce · Answer 1 · 2012-07-19T21:19:08+0000

Perhaps the problem is that / not for quotation marks, but only for delimiting. I assume that the third character is always used to define a delimiter. In addition, you do not need / or ; at the exit, are you?

I just did the following only with split for the case of the L and C command:

 >>> def parse(cmd): ... delim = cmd[2] ... return cmd.split(delim) ... >>> c_cmd = "C /LOCATE ME/CHANGED ME/ * *" >>> parse(c_cmd) ['C ', 'LOCATE ME', 'CHANGED ME', ' * *'] >>> c_cmd2 = "C ;ab;ba;" >>> parse(c_cmd2) ['C ', 'a b', 'b a', ''] >>> l_cmd = "L /spaced string/" >>> parse(l_cmd) ['L ', 'spaced string', '']

For the optional " * *" you can use split(" ") in the last list item.

 >>> parse(c_cmd)[-1].split(" ") ['', '*', '*']

The most efficient way to analyze this scripting language

More articles: