Writing an extremely simple parser

I am writing a very simple web server that should support an extremely limited special server-side scripting language. Basically, all I need to support is an echo, adding / subtracting / multiplying (without dividing) with only two operands, a simple function called "date ()" that prints the date and uses "&"; operator for string concatenation.

An example would be:

echo "Here is the date: " & date(); echo "9 x 15 = : & 9*15; 

I went through and created the code needed to create the tokens, but I'm not sure I use the correct tokens.

I created tokens for the following:

 ECHO - The echo command WHITESPACE - Any whitespace STRING - A string inside quotations DATE - The date() function CONCAT - the & operator for concatenation MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc) TERM - The terminal character (;) 

MATH, which I do not particularly know. Usually I see that people create a marker specifically for integers, and then for each operator, but since I ONLY want to allow binary operations, I thought it makes sense to group it into one token. If I did everything separately, I would have to do extra work so that I would never accept "5 + 4 + 1".

So, question 1, am I on the right track with which to use tokens?

My next question is: what should I do with these tokens in order to provide the correct syntax? The approach I was thinking about was basically to say: "OK, I know that I have this token, here is a list of tokens that are allowed to enter the current token. Is the next token on the list?"

Based on this, I compiled a list of all my tokens, as well as the fact that the tokens are valid to appear immediately after them (for simplicity, there were no spaces).

 ECHO -> STRING|MATH|DATE STRING -> TERM|CONCAT MATH -> TERM|CONCAT DATE -> TERM|CONCAT CONCAT -> STRING|MATH|DATE 

The problem is that I'm not sure how best to implement this. Actually, I need to track spaces to make sure there are spaces between tokens. But this means that I have to look forward two tokens at a time that is becoming even more intimidating. I'm also not sure how to manage valid next to tokens without any disgusting part of if blocks. Do I have to check the valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach the unexpected token? In this simple example, everything will always work, just perfectly understanding from left to right, there are no real priority rules (except for MATH, but this is part of why I combined it into one token, even if it does not feel right.) Despite this, I Don't think about developing a more scalable and elegant solution.

In my research on writing parsers, I see many references to creating the functions "accept ()" and "expect ()", but I cannot find a clear description of what they should do or how they should work.

I think I just donโ€™t know how to implement this, and then how to actually get the summary line at the end of the day.

I am heading in the right direction and does anyone know a resource that can help me understand how best to implement something simple? I have to do this manually and cannot use a tool like ANTLR.

Thanks in advance for your help.

+7
source share
2 answers

The first thing you need to do is to discard all white spaces (except for lines in lines). Thus, when you add tokens to the token list, you are sure that the list contains only valid tokens. For example, consider this statement:

 echo "Here is the date: " & date(); 

I will start tokenization and first select echo based on white space (yes, it takes free space here to separate it, but after that it is not useful). Then the tokenizer encounters a double quote and continues to read everything until a closing double quote is found. Similarly, I create separate tokens for and , dates and () .

My token list now contains the following tokens:

echo
"Here is the date:"
&
date
()

Now, at the parsing stage, we read these tokens. The parser goes through each token in the token list. It reads echo and checks if it is valid (based on the rules / functions that you have for the language). He advances to the next token and sees if it is either a date , a string, or math . Similarly, it checks the remaining tokens. If at some point the token should not be there, you can throw an error indicating a syntax error or something like that.

To tokenize math operators, combine the expression in the bracket and the other operands and operators separately. For example: 9/3 + (7-3 + 1) will have tokens 9, /, 3, + and (7-3 + 1). Since each token has its own priority (which you define in the token structure), you can start evaluating from the highest priority token to the lowest token priority. That way you can have priority expressions. If you still have confusion, let me know. I will write you an example code.

+2
source

expect is what your parser does to get the next token, and fails if the token is not the correct next token. First of all, your parser expects ECHO or WHITESPACE. These are the only acceptable baseline conditions. Seeing "ECHO", your expects parser is one of WHITESPACE | STRING | MATH | DATE everything else is a mistake. And so on.

accept is when your parser sees the full โ€œexpressionโ€ - ECHO, followed by a valid sequence of tokens, and then TERM. Your analyzer now has enough information to process your ECHO command.

Oh, and handwritten parsers (especially simple ones) are very often disgusting collections of if blocks (or moral equivalents such as switch ) :) Further, the elegance line would be something like on and on - from a grammar generator such as yacc or GOLD Parser Generator (which, in turn, displays ugly if , switch and state machines for you).

EDIT to provide more details.

To help sort out your responsibilities, create a lexer whose job it is to read input and create tokens. This includes determining what tokens mean. An easy sign is the word echo. A less light token is a mathematical operation; the token will consist of one or more digits, an operator, and one or more digits without spaces between them. The lexer will take care of skipping spaces, as well as understanding the string with quotes and characters that form the date () function. The lexer will return two things: the type of token reading and the value of the token (for example, "MATH" and "9 * 15").

Using a lexer to read your input, the parser consumes tokens and ensures that they are in the correct order. First you should see the ECHO token. If not, crash with the error message. After that you should see STRING, DATE or MATH. If not, crash with the error message. After that, you execute the loop, observing either TERM or CONCAT, followed by other STRING, DATE or MATH. If you see TERM, break the loop. If you see neither TERM nor CONCAT, an error message fails.

You can handle the ECHO command during parsing, as this is a simple grammar. Each time you discover STRING, DATE, or MATH, evaluate it and combine it with what you already have. When you find TERM, exit the function and return the inline string.

Questions? Comments? Omelets? :)

+1
source

All Articles