What grammars can be analyzed using recursive descent without backtracking?

Question

What grammars can be analyzed using recursive descent without backtracking?

According to Wikipedia's “Recursive Descent Parser”, recursive descent without backtracking (also known as parsing) is only possible for LL (k) grammars.

Elsewhere, I read that the Lua implementation uses such a parser. However, the language is not LL (k). In fact, Lua is inherently ambiguous: a = f(g)(h)[i] = 1 means a = f(g); (h)[i] = 1 a = f(g); (h)[i] = 1 or a = f; (g)(h)[i] = 1 a = f; (g)(h)[i] = 1 ? This ambiguity is eliminated by greed in the parser (therefore, it is analyzed above as erroneous a = f(g)(h)[i]; = 1 ).

This example shows that predictive parsers can process grammars that are not LL (k). Is it true that they can handle a superset of LL (k)? If so, is there a way to find out if a given grammar is in this superset?

In other words, if I am developing a language that I would like to parse using a predictive analyzer, do I need to restrict the LL (k) language? Or is there a looser restriction that I can apply?

+8

lua parsing context-free-grammar recursive-descent ll

user200783 Aug 21 '17 at 12:02

source share

1 answer

rici · Accepted Answer · 2017-08-25T00:50:01+0000

TL; DR

For a suitable definition of what a recursive descent analyzer is, it is absolutely true that only LL (k) languages can be analyzed using recursive descent.

Lua can be processed by a recursive descent parser precisely because the language is LL (k); those. for Lua there is a grammar LL (k). [Note 1]

1. The LL (k) language may have non-LL (k) grammars.

LL (k) language, if there is an LL (k) grammar that recognizes the language. This does not mean that every grammar that recognizes a language is LL (k); there can be any number of non-LL (k) grammars that recognize a language. So, the fact that some kind of grammar is not LL (k) says absolutely nothing about the language itself.

2. Many practical programming languages are described with an ambiguous grammar.

In formal language theory, language is inherently ambiguous only if each grammar is ambiguous for the language. It is probably safe to say that no practical programming language is inherently ambiguous, since practical programming languages are deterministically analyzed (in some way). [Note 2].

Since writing a strictly undifferentiated grammar can be tedious, quite often an ambiguous grammar is provided for language documentation, as well as textual material that indicates how the ambiguities should be resolved.

For example, many languages (including Lua) are documented with grammar that does not explicitly include operator precedence, allowing a simple rule for expressions:

 exp ::= exp Binop exp | Unop exp | term

This rule is clearly ambiguous, but given the list of operators, their relative priorities and whether each operator is left-handed or right-associative, the rule can be mechanically decomposed into an unambiguous expression. Indeed, many parser generators allow the user to provide priority declarations separately and perform mechanical expansion in the process of creating a parser. It should be noted that the resulting parser is a parser for an ambiguous grammar, so the ambiguity of the original grammar does not mean that the parsing algorithm is capable of processing ambiguous grammars.

Another common example of ambiguous reference grammars that can be mechanically removed is the “dangling else” ambiguity found in languages such as C (but not in Lua). Grammar:

 if-statement ::= "if" '(' exp ')' stmt | "if" '(' exp ')' stmt "else" stmt

certainly ambiguous; the intention is for parsing to be greedy. Again, ambiguity is not inherent. There is a mechanical transformation that gives an unambiguous grammar, something like the following:

 matched-statement ::= matched-if-stmt | other-statement statement ::= matched-if-stmt | unmatched-if-stmt matched-if-stmt ::= "if" '(' exp ')' matched-statement "else" matched-statement unmatched-if-stmt ::= "if" '(' exp ')' statement | "if" '(' exp ')' matched-statement "else" unmatched-if-stmt

For parser generators, it is often quite implicit to perform this conversion. (For the LR parser generator, the conversion is actually implemented by removing the reduction actions if they contradict the shift action. This is simpler than the grammar conversion, but it has exactly the same effect.)

Thus, Lua (and other programming languages) are not inherently ambiguous; and therefore, they can be parsed using parsing algorithms that require unambiguous deterministic parsers. Indeed, it may even be a little surprising that there are languages for which every possible grammar is ambiguous. As stated in the Wikipedia article cited above, the existence of such languages was proved by Rohit Parikh in 1961; a simple example of an inherently ambiguous context-free language -

{a ⁿ b ^m c ^m d ⁿ |n,m≥0} ∪ {a ⁿ b ⁿ c ^m d ^m |n,m≥0} .

3. Greedy LL (1) analysis of Lua syntax sentences and function operators

As in the case of the above construction, the ambiguity of sequences of Lua operators is performed only by allowing greedy parsing. Intuitively, the procedure is simple; it is based on the prohibition of two consecutive statements (without an intermediate semicolon), where the second begins with a token that can continue the first.

In practice, there is no need to perform this transformation; this can be done implicitly during analyzer construction. Therefore, I am not going to create a complete Lua grammar. But I believe that a small part of the Lua grammar here is enough to illustrate how a transformation can work.

The following subset (largely based on reference grammar) demonstrates exactly the ambiguity indicated in the OP:

 program ::= statement-list statement-list ::= Ø | statement-list statement statement ::= assignment | function-call | block | ';' block ::= "do" statement-list "end" assignment ::= var '=' exp exp ::= prefixexp [Note 3] prefixexp ::= var | '(' exp ')' | function-call var ::= Name | prefixexp '[' exp ']' function-call ::= prefixexp '(' exp ')'

(Note: (I use Ø to represent an empty string, not ε , λ or %empty .)

The Lua grammar is left recursive, so it is clearly not LL (k) (independent of ambiguity). Removing left recursion can be done mechanically; I have done enough of this here to demonstrate that a subset of LL (1). Unfortunately, the transformed grammar does not preserve the structure of the parsing tree, which is a classic problem with LL (k) grammars. It’s usually easy to reconstruct the correct parsing tree during a recursive descent analysis, and I won’t go into details.

It is easier to provide LL (1) with an exp version, but the result eliminates the difference between var (which can be assigned) and function-call (which cannot):

 exp ::= term exp-postfix exp-postfix ::= Ø | '[' exp ']' exp-postfix | '(' exp ')' exp-postfix term ::= Name | '(' exp ')'

But now we need to recreate the distinction in order to be able to analyze assignment operators and function calls. This is straightforward (but not conducive to understanding the syntax, IMHO):

 a-or-fc-statement ::= term a-postfix a-postfix ::= '=' exp | ac-postfix c-postfix ::= Ø | ac-postfix ac-postfix ::= '(' exp ')' c-postfix | '[' exp ']' a-postfix

To make the greedy single unambiguous, we need to forbid (from the grammar) any appearance of S ₁ S ₂ , where S ₁ ends with exp , and S ₂ begins with the symbol '('. In effect, we need to distinguish between different types of operators, depending on whether the statement begins with ( and independently, whether the statement ends with exp . (In practice, there are only three types, because there are no instructions starting with ( and not ending with exp . [Note 4])

 statement-list ::= Ø | s1 statement-list | s2 s2-postfix | s3 s2-postfix s2-postfix ::= Ø | s1 statement-list | s2 s2-postfix s1 ::= block | ';' s2 ::= Name a-postfix s3 ::= '(' exp ')' a-postfix

4. What is descent recursive parsing and how can it be changed to include values?

In the most common use, the analyzer with predictive recursive descents is an implementation of the LL (k) algorithm in which each nonterminal is mapped to a procedure. Each nonterminal procedure begins by using a table of possible viewing sequences of length k to decide which alternative production to use for this nonterminal, and then simply “executes” the production character by character: the final characters cause the next input, the character that should be discarded if it matches , or an error message if it does not match; nonterminal characters invoke a nonterminal procedure.

Lookahead sequence tables can be built using FIRST _k and FOLLOW _k . (The product A→ω mapped into a sequence of α terminals if α ∈ FIRST _k (ω FOLLOW _k (A)) .) [Note 5]

Using this definition of recursive descent analysis, a recursive descent parser can process exactly and only LL (k) languages. [Note 6]

However, aligning the LL (k) parsers and the recursive descent ignores the important aspect of the recursive descent parser, which is that it is primarily a program, usually written in some Turing programming language. If this program is allowed to deviate slightly from the rigid structure described above, it can analyze a much larger set of languages, even languages that are not context-free. (See, for example, C-contextual sensitivity indicated in Note 2.)

In particular, it is very simple to add a "default" rule to display tables in views. This is a very tempting optimization since it greatly reduces the size of the lookahead table. Typically, the default rule is used for non-terminals, alternatives of which include the empty right-hand side, which in the case of LL (1) grammar will be mapped to any character in the FOLLOW set for non-terminal. In this implementation, the lookahead table includes only lookaheads from the FIRST set, and the parser automatically creates the empty right-hand side corresponding to the immediate return for any other character. (As with similar optimization in LR (k) analyzers, this optimization may slow down error recognition, but they are still recognized before reading an additional token.)

The LL (1) analyzer cannot include a NULL non-terminal whose FIRST and FOLLOW commands contain a common element. However, if the recursive descent analyzer uses the default optimization, this conflict will never be noticed when building the analyzer. In fact, ignoring the conflict allows you to build a "greedy" parser from (certain) non-deterministic grammars.

This is extremely convenient, because, as we saw above, the development of unique greedy grammars is a lot of work and does not lead to anything even vaguely reminiscent of a clear exposition of the language. But the modified recursive parsing algorithm is not more powerful; it simply analyzes the equivalent grammar of SLL (k) (without actually constructing this grammar).

I am not going to give a complete proof of the above statement, but the first step is to observe that any non-terminal can be rewritten as a disjunction of new non-terminals, each of which has one separate FIRST token, and possibly a new non-terminal with an empty right side. Then, “only” you need to remove the nonterminals from the FOLLOW set of zero nonterminals, creating new clauses.

Notes

Here I am talking about a grammar that works with a stream token in which comments were deleted, and other constructions (for example, lines limited by "long brackets") are reduced to a single token. Without this transformation, the language would not be LL (k) (because comments that can be arbitrarily long interfere with the visibility of the view marker). It also allows me to get around the question of how long the brackets can be recognized using the LL (k) grammar, which is not particularly relevant to this question.
There are programming languages that cannot be deterministically analyzed using context-free grammar. Perl is probably the best-known example, but there is also the well-known C (x)*y constructor that can only be determined using information about the x character - whether it's a variable name or a type alias - and the difficulty of correctly parsing C ++ expressions using patterns. (See, for example, questions Why C ++ cannot be parsed by the LR (1) parser? And Is C ++ context-sensitive or context-sensitive? )
For simplicity, I deleted various literal constants (strings, numbers, Booleans, etc.), as well as table constructors and function definitions. These tokens cannot be the object of a function call, which means that an expression ending with one of these tokens cannot be expanded using the expression in brackets. Removing them simplifies the illustration of values; The procedure is still possible with full grammar, but it is even more tedious.
With a full grammar, we will also need to consider expressions that cannot be expanded with ( , so there will be four different options.
There are determinate LL (k) grammars that cannot create unambiguous parsing tables using this algorithm, which Sippu and Soisalon-Soininen call the Strong LL (k) algorithm. You can increase the algorithm using an additional parsing state similar to the state in the LR (k) parser. This may be convenient for specific grammars, but it does not change the definition of LL (k) languages. As Sippu and Soisalon-Soininen show, mechanically, from any grammar LL (k), the grammar SLL (k), which creates exactly the same language, can be obtained. (See Theorem 8.47 in Volume 2).
The recursive determination algorithm is an exact implementation of the canonical LL (k) analyzer based on the stack, where the parser stack is implicitly created at runtime using a combination of the current continuation and the activation record stack.

What grammars can be analyzed using recursive descent without backtracking?

TL; DR

1. The LL (k) language may have non-LL (k) grammars.

2. Many practical programming languages ​​are described with an ambiguous grammar.

3. Greedy LL (1) analysis of Lua syntax sentences and function operators

4. What is descent recursive parsing and how can it be changed to include values?

Notes

More articles:

2. Many practical programming languages are described with an ambiguous grammar.