C # Dragon Book (Lexical Analysis) How to process literals

This project is for educational use, and I know very well that great compilers already exist.

Currently, I am making my way through the famous book of dragons and have just started selling my own Lexer. It works surprisingly well, except for literals. I don’t understand how to process literals using lookup tables, and the book doesn’t seem to describe this very well:

The following code 60 has a numeric literal:

 int myIdentifier = 60; 

The book of the Dragon reads:

From a technical point of view, for lexeme 60 we need to create a token similar to (number, 4), where 4 indicates a character table for the internal representation of the integer 60 [...]

Got it - I created the following token:

 <enum TokenType, int lookupIndex> //TokenType could be 'number' and lookupIndex could be any int 

And he saved the literal in the dictionary as follows:

 Dictionary<int literal, int index> //literal could be '60' and index could be anything 

Since the literal itself is the key in the dictionary, this allows me to quickly check if future literals are added to the symbol table (or not).

Parser then receives the tokens from Lexer and should be able to identify literals in the symbol table.

Questions:

  • Why should my token contain a search index instead of containing the literal itself?
    It would not be faster ...
  • How should Parser quickly find literal values ​​inside a character table when the search index is a dictionary value?
    (I cannot make index-index a dictionary key, because Lexer will then have to check the value of the dictionary, which is also not very efficient)
    Could a multi-index dictionary be a solution? Probably not ...
  • Should I create a character table for each type of literal?
    Fe: Dictionary<int literal, int index>
    and Dictionary<double literal, int index>
    and Dictionary<char literal, int index> etc.
  • Perhaps I am on the wrong path with literals. Feel free to post any best solutions.
+5
source share
1 answer
  • Why should my token contain a search index instead of containing the literal itself? Wouldn't it be faster?

    Of course, most likely, it will be faster. But then each literal would be a different value. Now most programmers have the expectation that if they use, for example, "this longish string" twice in the same program, the compiler will be smart enough to release only one copy of this line in the final executable. And it would also be, say, surprising if you decompiled the code, you found 273 different storage locations for constant 1 , because every time the compiler saw a += 1 , it created a new constant.

    The easiest way to ensure that constant literals are emitted only once is to store them in an associative container indexed by the value of the literal.

    As @ sepp2k points out in the communique, most hardware allows you to use small integer constants as direct operands, and sometimes not even very small constants. Thus, the statement about constant 1 above is a little exaggeration. You may be able to process integers in different ways, but this may not be worth the trouble.

  • How should Parser quickly find literal values ​​inside a character table when the search index is a dictionary value?

    It depends a lot on the exact data structure you use for literal tables (which I don't like to call symbol tables, but admittedly these concepts are related). In many languages, you will find that your standard library libraries are not the perfect match for the problem, so you will either have to adapt them to the target or replace the records.

    However, it is not very difficult. One possibility is to use a combination of a map<literalType, int> and a vector<literalType> . Here, mapping maps literal values ​​to indices in a vector. When you find the new value of the literal, you enter it into the map associated with the current size of the vector, and then click on that value on the vector (which will make its index correspond to the index that you just inserted into the map).

    This is not quite ideal for large constants, such as strings, because between a key on the map and a value in a vector, the constant is stored twice. When you start, I would recommend just suppressing your annoyance about this duplication; later, if this turns out to be a problem, you can find a solution.

    If you used C ++, you can use a (unordered) set instead of a map and use a link (pointer) to a newly added element instead of an index. But I do not think that this function is available in many languages, and pointers are sometimes inconvenient compared to indexes. In some languages, you can put all the values ​​in a vector, and then save the set whose keys were an index in the vector. This requires that dialing can be done using something other than the key type; for some reason, this feature is available in very few data libraries.

    And, yes, you can use a bidirectional data structure if you have one of these convenient ones. (In essence, the map + vector vector solution is a doubly indexed data structure.)

  • Should I create a character table for each type of literal?

    May be. How many kinds of literals do you have?

    You will probably end up using indexed enumerations (“discriminatory joins”) for both variables and constants. (Again, not all languages ​​discriminated against unions in their standard library, which is really sad if your implementation language lacks this basic function, you will need to implement it.) Of course, it is possible that a discriminated instance of a union will be used as a key in an associative data structure, so there is nothing that would prevent you from basically storing all your literals in a single data structure. If you have the appropriate types, this is definitely what I would recommend, at least at startup.

    Note that when you ultimately emit literals as object code, you are more interested in their representation and alignment of bits than their semantics. If two constants of completely different types have the same bit representation, then you can use the same storage location for both of them. If you have multiple widths of integer data types, then you probably want to save them all in one literal table to take advantage of this optimization. No need to store 1 each width :). Sometimes you will find other cases where two literals of different types have the same representation, but this is probably not common enough to get out of your way to handle this. (However, on IEEE equipment, floating point and integer zeros have the same representation, and this is usually the same representation as the NULL pointer, so you might want to use special case registers.)

    Overall, this is a call to court. How difficult is it to use discriminatory union as a key? How much storage could you save by having associative containers with certain types of keys, and does it matter? Do you want to iterate over all literal constants of the same type (answer: possible) and how easily is it related to your data structures?

    If you use a well-developed internal API, you can change your mind about the internal representation of your literal tables. Therefore, I would use this experiment as an opportunity to try a good API design.

  • Anything else?

    Good luck with your project. Learn and enjoy!

+2
source

All Articles