Context-sensitive code tokenization

Question

Context-sensitive code tokenization

I am working on a parser for a language with

(e.g. a letter followed by several alphanumeric characters or underscores)
integers (any number of digits and possibly carriage ^),
some operators
filename (number of alphanumeric characters and possibly slashes and periods)

Obviously, the file name covers integers and identifiers, so I can’t decide at all if I have a file name or, say, an identifier if the file name does not contain a slash or dot.

But the file name can only follow a specific operator.

My question is, how is this situation usually handled during tokenization? I have a tokenizer with a table (lexer), but I'm not sure how to specify a file name from an integer or identifier. How it's done?

If filename was a superset of integers and identifiers, then I could probably create grammar pieces that could handle this, but the markers overlap ...

+4

tokenize parsing token formal-languages

akonsu Aug 21 '15 at 15:47

source share

1 answer

geoff_h · Accepted Answer · 2015-08-21T16:22:16+0000

Flex and other lexers have a concept of a trigger condition . In fact, a lexer is a state machine, and its exact behavior will depend on its current state.

, , , FilenameMode ( - ), , , .

EDIT:

, :

FILENAME_MODE, ...

{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }

:

<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }

... INITIAL .

Context-sensitive code tokenization

More articles: