You can analyze indentation-sensitive languages (like Python or Haskell) using a small hack, which is well described in the Python help chapter in lexical analysis . As described, the lexical analyzer turns leading spaces into INDENT and DEDENT [Note 1], which are then used in a Python grammar in a simple way. Here is a quick excerpt:
suite ::= stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT statement ::= stmt_list NEWLINE | compound_stmt stmt_list ::= simple_stmt (";" simple_stmt)* [";"] while_stmt ::= "while" expression ":" suite ["else" ":" suite]
So, if you are ready to describe (or link) the lexical analysis algorithm, BNF is simple.
However, you cannot write this algorithm as context-free grammar, because it is not contextual. (I will leave this proof, but it is similar to the proof that a n b n c n not the context that you can find in most simple textbooks on the formal language and across the Internet.)
The ISO EBNF standard (free PDF is available) provides a way to include “extensions that a user may require”: a Special-sequence , which is any text that does not contain a ? surrounded on both sides by a symbol ? . Thus, you can abuse the notation by including [Note 2]:
DEDENT = ? See section 2.1.8 of https://docs.python.org/3.3/reference/ ? ;
Or you can insert a full description of the algorithm. Of course, none of these methods would allow the parser to produce an accurate lexical analyzer, but it would be a reasonable way to convey intentions to the human reader.
It is worth noting that EBNF itself uses a special sequence to determine one of its production facilities:
(* see 4.7 *) syntactic exception = ? a syntactic-factor that could be replaced by a syntactic-factor containing no meta-identifiers ? ;
Notes
The lexical analyzer also converts some physical newline characters to NEWLINE tokens, and other newline characters disappear.
EBNF usually uses the syntax = instead of ::= for production and insists that they be completed with ; . Comments are between (* and *) .
rici
source share