End state analyzer

I would like to analyze a self-developed file format using a parser similar to FSM in C ++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult project teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult :)). I have a tokenized string with newlines indicating the end of the line euh .... See here for an example input . All comments will be filtered, and I have std :: string, like this:

 global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ... 

Syntax Explanation:

  • {} are areas, and the headwords mean that you should follow the list of options / files.
  • \ n are important only in the list of options / files, signifying the end of the list.

Therefore, I thought that FSM would be quite simple and affordable for my needs / knowledge. As far as I can tell (and want my file design to be), I don't need parallel states or anything like that. Some design / implementation issues:

  • Should I use enum or abstract class + derivatives for my states? The first is probably better suited for a little syntax, but can later become ugly, and the second the opposite. I am inclined to the first, for its simplicity. enum example and class example . EDIT: what about this suggestion for goto , I thought they were evil in C ++?
  • When reading the list, I need NOT to ignore \n . My preferred way to use string via stringstream ignores \n by default. So I need an easy way to say (same!) stringstream so as not to ignore newline characters when a certain state is enabled.
  • Will simple enum states be sufficient for multilevel parsing (scope within the scope {...{...}...} ) or do hacker implementations be needed?
  • The draft here says what I mean:
    • upper : reads global, exe, lib + target names ...
    • normal : inside the scope you can read SOURCES ..., create custom variables ...
    • list : adds items to the list until a new line appears.

Each area will have a kind of conditional (for example, win32: global {gcc: CFLAGS = ...}) and it will need to be processed in exactly the same way eveywhere (even in the list state for each item).

Thanks for any input.

+4
source share
3 answers

If you have nesting areas, the finite state machine is not suitable for the correct path, and you should look at the context free grammar analyzer. Paragraph LL (1) can be written as a set of recursive functions, or the LALR Parser (1) can be written using a parser generator such as Bison.

If you add a stack to the FSM, you get into the deterministic launcher area strictly less efficient.) LALR (1) parser generators actually generate a deterministic eject machine. A good compiler development tutorial will cover the exact algorithm with which a pop-up machine is constructed from a grammar. (Thus, adding a stack is not β€œhacked.”) This Wikipedia article also describes how to build an LR (1) automaton from your grammar, but IMO, the article is not as clear as it could be.

If your areas have only a finite depth level (i.e. you have upper , normal and list levels, but you don't have nested list or nested normal s)), you can use FSM without a stack.

+12
source

There are two steps to parsing a text input stream for parsing:

Lexical analysis: Here your input stream is divided into lexical units. It looks at a sequence of characters and generates tokens (similar to a word in an oral or written language). Finite state machines are very good at lexical analysis if you have made a good design decision regarding the lexical structure. From your data above, individual tokens will be like your keywords (for example, "global"), identifiers (for example, "bitwise", "SOURCES"), symbolic characters (for example, "{" "}", "." , "/"), numeric values, escape values ​​(for example, "\ n"), etc.

Parsing / grammar analysis: After generating a sequence of tokens (or perhaps while you are doing this), you should be able to analyze the structure to determine if the token sequence matches your language. To do this, you generally need some kind of parser, although if the structure of the language is not very complicated, you can do this using a state machine. In general (and since you want nested structures to be in your case in particular), you will need to use one of the methods that Ken Bloom describes.

Therefore, in response to your questions:

Should I use an enumeration or abstract class + derivatives for my states?

I found that for small tokenizers, a matrix of state / transition values ​​is suitable, something like next_state = state_transitions[current_state][current_input_char] . In this case, next_state and current_state are some integer types (including, possibly, a numbered type). Input errors are detected when switching to an invalid state. The end of the token is identified based on the state identifier of the actual finite elements without a valid transition available to another state, taking into account the next input symbol. If space bothers you, you can use a map vector instead. Creating state classes is possible, but I think it's probably harder to do something than you need.

When reading the list, I need NOT to ignore \ n.

You can either create a token called "\ n", or a more generalized exit marker (identifier preceded by a backslash. If you are talking about identifying line breaks in the source, then these are just the characters you need to create transitions for your transition matrix statuses (note the difference between line breaks on Unix and Windows, however you can create an FSM that runs on them).

Will simple enumeration states be sufficient for multilevel parsing (scope within the scope ... {...} ...}), or will this require hacker implementations?

Here you will need a grammar machine or pushdown if you cannot guarantee that the nesting will not exceed a certain level. Even then, this is likely to make your FSM very complex.

Here in the project it is indicated what I mean: ...

See my comments on lexical and grammar analysis above.

+3
source

For parsing, I always try to use something that has already been proven to work: ANTLR with ANTLRWorks , which helps a lot in developing and testing grammar. You can generate code for C / C ++ (and other languages ), but you need to create ANTLR runtime for these languages.

Of course, if you find flex or bison easier to use, you can use them too (I know that they only generate C and C ++, but I could be wrong since I haven't used them for a while).

+1
source

Source: https://habr.com/ru/post/1313391/


All Articles