Better solution than lex / yacc for parsing DSL in C?

One of my programs accepts commands (e.g. kill foo ) at runtime. Think of it as a small domain language. Here are some examples:

 kill kill client exit 

But chained teams are also allowed, and spaces are not significant before and after the teams, so the following examples are true:

 kill ; say "that was fun" kill ; kill ; kill; 

I have currently implemented this with lex / yacc (flex / bison) and this has caused a lot of headache. The lexer is very context sensitive (for example, tokens in the form of spaces usually do not return unless after the kill keyword) and has many different states. The grammar had conflicts, and I don't really like the format in which it should be specified (especially for $ 1, $ 2, $ 3, ... to use arguments for non-terminals). In addition, the error messages that the bison provides (during parsing) are sometimes accurate but often not used (the kill command with additional arguments leads to error messages, for example Unexpected $undefined, expected $end or ; for kill clont instead of kill client ). Finally, the C API for yacc is cruel (external defines around the place).

I do not ask you to solve all the above issues (I will open separate streams with more specific descriptions and code if there is no lex / yacc path). Instead, I'm interested in alternatives to lex / yacc.

My criteria are as follows:

  • Input is a string (const char *), there is no output, but instead you need to call the code for every other keyword.
  • I want to use this with C (C99).
  • The software should already be included in major Linux distributions, or at least easy to bundle / package.
  • It should be well documented.
  • The syntax for describing my language should be easy.
  • It should output meaningful error messages when analyzing errors.
  • Performance is not so important (of course, it should be fast, but a typical use case is interactive use, rather than processing tons of MB instructions).
+4
source share
6 answers

As for the very simple and small grammar, I would think of writing lexer / parser manually - often this is not so much.

Almost all linux distributions provide the lex / yacc option. In addition, two other commonly used parser generators are lemon and antlr .

+3
source

Since your language looks pretty simple, I suggest implementing a state machine that tokens and parses input.

Just read the input of one character at a time, specifying a space (not yet in the quotation mark). Each β€œcommand” transfers the machine to a different state, where it analyzes the arguments of the command. ";" or "\ n" reset the device to its original state.

+3
source

I like ANTLR a couple of times in production systems.

Something bizarre is that in version 2 it supported the generation of C ++ code, but not C, while in version 3 it supported the creation of C code, but not C ++. I like C ++ and still use ANTLR v2, but you'll probably like v3. So much the better for you.

Many distributions have ANTLR v2 packages, and some also have v3. It is fairly well documented (note that I am using v2, I hope v3 is no worse in this regard).

ANTLR does not generate state-of-the-art parsing error messages out of the box. This, apparently, is something common to most common parser systems, and in principle this is not an easy task. However, with some work, I saw some good diagnostic information coming from an ANTLR-based system (the application had some logic to help figure out what to say to the user - there is not much ANTLR here).

+1
source

An interesting alternative to Lex and Yacc is the Lemon parser. This is quite a lot for this, but I have not used it seriously, so I'm not quite sure how much it really works. It is used by SQLite.

+1
source

You might want to consider Ragel . I recently started using it and enjoyed working with you as soon as you get up to speed. In your example, you can do something like (note: not verified!):

 #include <stdio.h> #include <string.h> %%{ machine my_cmd_lang; action pk { printf("Killing %.*s\n", fpc-mark, mark); } action mk { mark = fpc; } k = 'kill'; # creates a machine that doesn't do anything x = 'exit' @{ printf("Exiting\n"); }; arg = alpha+ >mk; # arg to kill is built in machine 'alpha' 1 or more times cmd = ((k space arg) @pk space* ';'?) | x; main := cmd* ; }%% %% write data; int main(int argc, char* argv[]) { int cs; char* p = "kill client"; char* pe = p + strlen(p); char* mark; %% write init; %% write exec; return 0; } 

Run it through Ragel with ragel <filename.rl> and it will spit out <filename.c> .

+1
source

You will need a parser without lexing (for example, a PEG implementation). Since you are using C and already familiar with yacc, something like this might be worth a try.

And if your grammar is simple enough, you can implement a parser with recursive ad hoc descents instead.

0
source

All Articles