Context-sensitive code tokenization

I am working on a parser for a language with

  • (e.g. a letter followed by several alphanumeric characters or underscores)

  • integers (any number of digits and possibly carriage ^),

  • some operators

  • filename (number of alphanumeric characters and possibly slashes and periods)

Obviously, the file name covers integers and identifiers, so I can’t decide at all if I have a file name or, say, an identifier if the file name does not contain a slash or dot.

But the file name can only follow a specific operator.

My question is, how is this situation usually handled during tokenization? I have a tokenizer with a table (lexer), but I'm not sure how to specify a file name from an integer or identifier. How it's done?

If filename was a superset of integers and identifiers, then I could probably create grammar pieces that could handle this, but the markers overlap ...

+4
source share
1 answer

Flex and other lexers have a concept of a trigger condition . In fact, a lexer is a state machine, and its exact behavior will depend on its current state.

, , , FilenameMode ( - ), , , .

EDIT:

, :

FILENAME_MODE, ...

{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }

:

<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }

... INITIAL .

+2

All Articles