Analyzer with areas and conventions

I am writing a C / C ++ / ... build system (I understand this is crazy;)), and I had problems designing my parser.

My “recipes” are as follows:

global { SOURCE_DIRS src HEADER_DIRS include SOURCES bitwise.c \ framing.c HEADERS \ ogg/os_types.h \ ogg/ogg.h } lib static ogg_static { NAME ogg } lib shared ogg_shared { NAME ogg } 

(This is based on the super simple libogg source tree)

# are comments, \ are "newline characters", that is, the line continues on the next line (see QMake syntax). {} are areas like in C ++, and global are settings that apply to all "target" ones. This is the whole background, and it is not so important ... I really do not know how to work with my areas. I will need to have several areas, as well as a form of conditional processing in lines:

 win32:DEFINES NO_CRT_SECURE_DEPRECATE 

The parsing function needs to know at what level of its scope, and call itself whenever the scope increases. There is also a problem with the location of braces ( global { or global{ or as in the example).

How can I do this using Standard C ++ and STL? I understand that this is a whole job, and that is why I need a good starting point. Thanks!

What I already have is the whole stream and the internal string / stringstream storage, so I can read word by word.

+2
source share
4 answers

I would suggest (and this is more or less true from compiler tutorials) that you approach the problem in stages. This destroys the situation, so the problem at each stage is much more manageable.

Focus on the lexer phase first. Your lexical phase should take the source text and give you a sequence of tokens, such as words and special characters. The lexer phase can take care of the continuation of the line and process spaces or comments as necessary. By handling spaces, a lexer can simplify the parser’s task: you can write a lexer to global{ , global { and even

global
{

all will issue two tokens: one represents global and one represents { .

Also note that the lexer can bind row and column numbers to tokens for later use if you click errors.

Once you have a good stream of tokens, work on your parsing phase. The parser should take this sequence of tokens and build an abstract syntax tree that models the syntactic structures of your document. At the moment, you should not worry about ifstream and operator>> , since the lexer should have done everything that it read for you.

You indicated an interest in the recursive call of the parsing function as soon as you see the scope. This is definitely one way. As you will see, the design decision that you will have to execute many times is whether you want to literally invoke the same parsing function recursively (for constructs such as global { global { ... } } that you can deny syntactically) or want to define a few (or even significantly) different set of syntax rules that apply within the scope.

Once you find that you need to change the rules: the key is to reuse, by refactoring in the function, as much as you can reuse between the different syntax options. If you continue to move in this direction - using separate functions that represent different pieces of syntax that you want to deal with and have them call each other (possibly recursively) where necessary - you will end up with what we call the recursive descent parser. The Wikipedia entry has a simple example: see http://en.wikipedia.org/wiki/Recursive_descent_parser .

If you really want to delve into the theory and practice of lexers and parsers, I recommend that you get a good reliable tutorial for compilers to help you. In the "Stack Overflow" section mentioned in the comments above, you will begin: Learning how to write a compiler

+2
source

boost::spirit is a good recursive descent parset generator that uses C ++ templates as a language extension to describe the parser and lexer. It is great for native C ++ compilers, but will not compile under Managed C ++.

Codeproject has a tutorial article that can help.

+2
source

ANTLR (use ANTLRWorks ), after which you can search for FLEX / BISON and others such as lemon . There are many tools, but ANTLR and flex / bison should be enough. I personally like ANTLRWorks too much to recommend anything else.

LATER . With ANTLR, you can generate parser / lexer code for various languages .

+1
source

If a project item doesn't specifically study how to write a lexer and shift parser, I recommend using Flex and Bison, which will handle most of the parsing for you. Writing grammar and semantic analysis will still be a lot of work, do not worry;)

0
source

Source: https://habr.com/ru/post/1313395/


All Articles