Regex, writing a toy compiler, parsing, deleting comments

I am currently going through this book: http://www1.idc.ac.il/tecs/

Now I am in the section where excersize should create a compiler for a very simple Java language. The book always says what is required, but not how (how good). I should also mention that he talks about yacc and lex and specifically says to avoid them for projects in the book for the sake of learning on his own.

I am on chaper 10, which I am starting to write a tokenizer.

1) Can someone give me general advice - regular expression is the best approach for tokenizing the source file?

2) I want to remove comments from the source files before parsing - it is not difficult, but most compilers tell you which line the error occurs in, if I just delete the comments, it will spoil the number of lines, are there any simple strategies for saving the number of lines when removing garbage?

Thanks in advance!

+4
source share
2 answers

The tokenizer itself is usually written using a large DFA table that describes all the possible valid tokens (for example, a token can begin with a letter followed by other letters / numbers, followed by a non-letter number or a number, followed by other numbers and or not a number / a point, or a point followed by at least 1 number, and then not a number, etc. etc.). The way I built mine is to identify all the regular expressions that my tokenizer will take, convert them to DFA, and combine them.

Now, to β€œdelete comments” when you parse a token, you can have a comment token (a regular expression for parsing a comment too long for words to describe), and when you finish parsing that comment, you simply parse the new token, ignoring it. Alternatively, you can pass it to the compiler and let it deal with it (or ignore it as it will). Any aproach will save metadata, such as line numbers and characters in a line.

for DFA theory :

Each regular expression can be converted (and converted) to DFA for performance reasons. This eliminates any backtracks when parsing them. This link gives you an idea of ​​how this is done. First, you convert the regex to NFA (Countdown DFA), then delete all return nodes, inflating your state machines.

Another way you can build your DFA is to use common sense. Take, for example, state machines that can parse either an identifier or a number. This, of course, is not enough, since you most likely also want to add comments, but this will give you an idea of ​​the basic structures.

AZ space ->(Start)----->(I1)------->((Identifier)) | | ^ | +-+ | A-Z0-9 | | space +---->(N1)---+--->((Number)) <----------+ 0-9 | ^ | | | | | . 0-9 space | +-+ +--->(N2)----->(N3)--------+ 0-9 | ^ +-+ 0-9 

Some notes on the notation used, DFA starts with a (Start) node and moves along the arrows when the input is read from your file. At any moment, it can correspond to only one path. If there are no paths, the default error node is used. ((Number)) and ((Identifier)) are your ultimate pinnacles of success. Once in these nodes you will return the token.

So, from the very beginning, if your token begins with a letter, it should continue with a bunch of letters or numbers and end with a space (spaces, new lines, tabs, etc.). There is no return if this does not lead to a failure of the tokenization process, and you can report an error. You need to read a book of error correction theories to continue parsing, this is a really huge topic.

If, however, your token begins with a number, it must be accompanied by either a set of digits or one decimal point. If there is no decimal point, the space should follow the numbers, otherwise the number should follow it, and then a bunch of numbers followed by a space. I did not include scientific notation, but it is not difficult to add.

Now, for parsing speed, this is converted to a DFA table with all nodes both vertically and horizontally. Something like that:

  I1 Identifier N1 N2 N3 Number start letter nothing number nothing nothing nothing I1 letter+number space nothing nothing nothing nothing Identifier nothing SUCCESS nothing nothing nothing nothing N1 nothing nothing number dot nothing space N2 nothing nothing nothing nothing number nothing N3 nothing nothing nothing nothing number space Number nothing nothing nothing nothing nothing SUCCESS 

As you run this, you save your initial state and move around the table when you read your input character by character. For example, input β€œ1,2” will be parsed as start-> N1-> N2-> N3-> Number-> SUCCESS. If at any time you click "nothing" node, you will receive an error message.

edit 2: in fact, the table should be node β†’ character β†’ node, not node β†’ node β†’, but in any case, it worked fine. Some time has passed since the last time I wrote the compiler manually.

+5
source

1- The regex kernel implements the tokenizer well. If you use a generated lex tokenizer, you describe each token as a regular expression. see Mark the answer.

2- A lexer is what usually tracks row / column information, since tokens are consumed by the tokenizer, you track line / column information with a token or use it as the current state. Therefore, when a problem is found, the tokenizer knows where you are. Therefore, when processing comments when processing new lines, the tokenizer simply increases the value of line_count.

In Lex, you can also have parsing states. Multi-line comments are often implemented using these states, which simplifies regex. Once you find a match at the beginning of the comment, for example "/ *", you go into the comment state, which you can configure to be exclusive from the normal state. Therefore, when you consume text looking for a comment end marker '* /', you do not match normal tokens.

This state-based process is also useful for process sequence literals that allow nested finite element manufacturers, for example, to "test" more text. "

+1
source

All Articles