Lexer Output

I am currently writing a compiler and I am in the Lexer phase.

I know lexer tokenizes the input stream.

However, consider the following thread:

int foo = 0; 

if the lexer output is as follows: Keyword letter letter letter equals digit semicolon ? And then the parser reduces letter letter letter to identifier?

+2
compiler-construction lexer
source share
3 answers

In general, your lexer should create a stream of structures that contain language elements: operators, identifiers, keywords, comments, etc. These structures must be labeled with the type of token and contain content that is of the type lexeme it represents.

To ensure good error reporting, it is good if each token carries information about the start line and column, end line and column (some tokens span several lines) and the original source file (sometimes the parser should treat the included files as well as the main file).

For those language elements that contain the contents of the variable (numbers, identifiers, etc.), the structure must contain the contents of the variable.

To compile or analyze programs, the lexer can drop spaces and comments. If you are going to analyze / modify the code, you will need to write comments.

An example of a conclusion can be instructive. For an example OP example:

 /* My test file */ int foo = 0; // a declaration 

... The front of DMS C creates the following tokens (this is debug output, which is very convenient when developing a complex lexer):

 C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c Lexer Stream Display 1.5.1 Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I !! Lexer:ResetLexicalModeStack !! after Lexer:PushLexicalMode: Lexical Mode Stack: 1 C File "C:/temp/test.c", line 1: /* My test file */ File "C:/temp/test.c", line 2: File "C:/temp/test.c", line 3: int foo !! Lexer:GotoLexicalMode 2 CMain !! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000 <<< PreComments: Comment 1 Type 1 Line 1 Column 1 `/* My test file */' !! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000 !! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo' File "C:/temp/test.c", line 4: = 0; // a declaration !! Lexer:GotoLexicalMode 1 C !! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2: whitespace [VOID]=0000 !! Lexer:GotoLexicalMode 2 CMain !! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113: '=' [VOID]=0000 !! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2: whitespace [VOID]=0000 !! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138: INT_LITERAL [NATURAL]=0 File "C:/temp/test.c", line 5: !! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98: ';' [VOID]=0000 >>> PostComments: Comment 1 Type 2 Line 4 Column 10 `// a declaration' File "C:/temp/test.c", line 5: File "C:/temp/test.c", line 6: File "C:/temp/test.c", line 7: !! Lexer:GotoLexicalMode 1 C !! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2: whitespace [VOID]=0000 !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4: end_of_input_stream [VOID]=0000 !! Lexer:GotoLexicalMode 2 CMain !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 0: EndOfFile 11 lexemes processed. 0 lexical errors detected. C:\DMS\Domains\C\GCC4\Tools\Lexer\Source> 

The main output is the lines with the inscription !! , each of which represents the contents of the lexeme structure created by the lexer. Each token carries:

  • information about the location of the source file (for the main file "test.c" in this case, which is not printed to make the debug output more readable)
  • "token number" (type lexeme) and user-friendly token name (simplifies debugging)
  • the type of value carried by the token: [VOID] means "none", [STRING] means that the token carries string values, [NATURAL] means that it carries an integer value, etc.
  • precomments: comments preceding the token. This is unusual for classic lexers, but necessary if you are trying to convert the source code. You cannot lose comments! Please note that the preliminary comment is attached to the token; because comments are not semantically significant, it can be argued where they should be placed. This is our special choice.
  • postcomment: comments that follow the token belonging to it.

The last "token" EndOfFile is implicitly defined in each DMS lexis.

This debug trace also marks lexer transitions in lexical modes (many lexer generators have several modes in which they lex different parts of the language). It shows the source lines as they are read.

+4
source share

There is no real benefit to having a “letter” as an intermediate step — instead, “foo” should probably be an identifier. Otherwise, you could understand int as a letter, which does not make much sense.

+2
source share

In general, there is no simple answer.

Typically, a lexer identifies higher-level elements, such as an identifier, or even a type or variable, if the grammar of the languages ​​allows it. The more dynamic the grammar and the interpretation of tokens is more dependent on the internal state, if the parser then it may be easier to interpret the interpreter. Otherwise, the connection between lexer and parser may be too complex. (For example, consider a language where int is a type in one place and a valid variable name in another and a language keyword in the third case)

As a rule: let the lexer do all the work that facilitates the work with the grammar, without causing additional complexity between lexer and parser.

+2
source share

All Articles