In general, your lexer should create a stream of structures that contain language elements: operators, identifiers, keywords, comments, etc. These structures must be labeled with the type of token and contain content that is of the type lexeme it represents.
To ensure good error reporting, it is good if each token carries information about the start line and column, end line and column (some tokens span several lines) and the original source file (sometimes the parser should treat the included files as well as the main file).
For those language elements that contain the contents of the variable (numbers, identifiers, etc.), the structure must contain the contents of the variable.
To compile or analyze programs, the lexer can drop spaces and comments. If you are going to analyze / modify the code, you will need to write comments.
An example of a conclusion can be instructive. For an example OP example:
int foo = 0;
... The front of DMS C creates the following tokens (this is debug output, which is very convenient when developing a complex lexer):
C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c Lexer Stream Display 1.5.1 Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I !! Lexer:ResetLexicalModeStack !! after Lexer:PushLexicalMode: Lexical Mode Stack: 1 C File "C:/temp/test.c", line 1: File "C:/temp/test.c", line 2: File "C:/temp/test.c", line 3: int foo !! Lexer:GotoLexicalMode 2 CMain !! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000 <<< PreComments: Comment 1 Type 1 Line 1 Column 1 `' !! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000 !! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo' File "C:/temp/test.c", line 4: = 0;
The main output is the lines with the inscription !! , each of which represents the contents of the lexeme structure created by the lexer. Each token carries:
- information about the location of the source file (for the main file "test.c" in this case, which is not printed to make the debug output more readable)
- "token number" (type lexeme) and user-friendly token name (simplifies debugging)
- the type of value carried by the token: [VOID] means "none", [STRING] means that the token carries string values, [NATURAL] means that it carries an integer value, etc.
- precomments: comments preceding the token. This is unusual for classic lexers, but necessary if you are trying to convert the source code. You cannot lose comments! Please note that the preliminary comment is attached to the token; because comments are not semantically significant, it can be argued where they should be placed. This is our special choice.
- postcomment: comments that follow the token belonging to it.
The last "token" EndOfFile is implicitly defined in each DMS lexis.
This debug trace also marks lexer transitions in lexical modes (many lexer generators have several modes in which they lex different parts of the language). It shows the source lines as they are read.