What are the modern computer languages LL (1)?

Question

What are the modern computer languages LL (1)?

(I spend time on some kind of linguistic theory. Sorry if this is a naive question.)

According to here :

LL grammars, especially LL (1) grammars, are of great practical interest, because parsers for these grammars are easy to build, and for this reason many computer languages are designed for LL (1).

So, out of curiosity, what modern computer langauges are LL (1)? Are C, Java, C #, or Python included in this category?

+8

compiler-construction formal-languages ll

smwikipedia Jan 01 '16 at 11:08

source share

1 answer

rici · Accepted Answer · 2017-01-01T19:46:23+0000

I think I will be tempted to note that the quote is from Wikipedia [edit] ; the assumptions are at least dubious. For example, there are a large number of yacc- based compiler building tools that in practice make it extremely easy to create a parser using the more powerful (and equally fast) LALR algorithm, and some also implement the GLR algorithm with even more power (and almost as fast , in most practical grammars). Parsers for writing letters are not needed for decades. [Note 1]

In response to the question:

Most computer languages are not technically LL, because they are not even contextual. This will include languages that require identifier declarations (C, C ++, C #, Java, etc.), as well as languages with preprocessors and / or macro objects (e.g. C and C ++), languages with uncertainties that can only be resolved using semantic information (Perl would be the worst offender here, but C and C ++ are there too). And to spread the joy around a few more, it also includes languages that have a Python-like layout (Python, of course, as well as Haskell).
I put quotes around “technically” because there are many people who say that these exceptions are “not considered”. This is their opinion, and they have the right to it (and in any case, I refused to argue about this, although I do not share this opinion). You could, for example, exclude the preprocessor from C / C ++ by pre-processing the text before applying grammar or preprocessing languages with spaces by inserting INDENT, NEWLINE and DEDENT tokens instead of spaces, after which you could make some claims about the mystical "primary language". (This is much more difficult to apply to C ++ templates, which can only be resolved by parsing the text, so you cannot say that the extension can be done before parsing.)
The assertion that the language is not context-free because it requires the declaration of identifiers is perhaps more controversial. In some languages (not C / C ++, in which semantic analysis is necessary to avoid ambiguity), you can argue that the parse tree could be built without checking the declaration rule, which makes this rule "non-synthetic." But it remains that the declaration rule can be applied syntactically; you simply cannot do this with context-free grammar (for example, you can use Van Wijngaarden grammar ).
In any case, the overall parsing strategy is to recognize a superset of the language and then reject the inappropriate programs by subsequently analyzing the resulting parsing tree; in this case, the question arises whether the superset is LL. In my opinion, for this to be meaningful, it is necessary that every valid program can be analyzed into the correct syntax tree, which eliminates the use of semantic analysis to eliminate ambiguity.
The purpose of parsing is to create a syntax tree, and not just to recognize whether the text is valid or not. (You can skip this important fact if you look at formal textbooks in a language that tend to focus on recognition, perhaps because building syntax trees is less interesting.) Therefore, it seems reasonable to insist that the proposed grammar is actually syntactic language structure.
For example, you can recognize algebraic expressions using a simple common language:
```
(PREFIX* OPERAND POSTFIX*) (INFIX PREFIX* OPERAND POSTFIX*)* 
```
where PREFIX is a set of prefix operators, and also ( , POSTFIX is a set of postfix operators (if any), as well ) , INFIX is a set of infix operators (addition, subtraction, multiplication, etc.), and OPERAND is it is an identifier or constant.
This regular expression will not correctly reject expressions with unbalanced parentheses, but we already agreed that it would be nice to recognize a superset of the language, right ?:-)
If desired, we can remove the parentheses from the PREFIX and POSTFIX sets and make OPERAND recursive. The resulting grammar is trivial LL (1) [Note 2]:
```
 operand => IDENTIFIER | CONSTANT | '(' expression ')' term => operand | PREFIX operand | operand POSTFIX expression => term | term INFIX expression 
```
The problem, however, is that this grammar does not capture operator precedence. He does not even try to recognize the fact that a+b*c and a*b+c are both sums, so there is a top-level operator + in both cases, and no operator comes first in the expression. (If you were analyzing APL, that was not a problem. But most languages meet the usual expectations regarding operator precedence.)
This point is important because LL grammar cannot process left-recursive statements, and you need left-recursive production to correctly parse the left-associative operator. (That is, correctly analyzing abc as ((ab)-c) , rather than (a-(bc)) , which will have a different meaning.) Again, you can argue that it is a pun, as you could process the tree parsing in order to correct associativity. This, of course, is true, but the result is that the grammar you use for parsing is different from the grammar you use to explain the syntax of the language. This may or may not bother you.
It is worth adding here that there are languages (Haskell, Prolog, possibly many others) in which you can define the operators and their priority in the program text. This, obviously, makes it impossible to create the correct syntax tree without semantic analysis, and the usual approach to the analysis of such languages is to use the simplified language of "alternating operands and operators" described above. After all operator priorities are known, presumably at the end of the analysis, all expressions are then parsed using something like the Shunting Yard algorithm to produce the correct parsing.
Suppose we dismiss the above objections and simply ask, "for which commonly used programming languages can an LL parser be used?"
To avoid cheating, we must really require that the LL analyzer capture the scan and not require a refund. If you allow arbitrary viewing and reverse tracking, then you significantly expand the scope of the analyzed languages, but I assert that you are no longer in the sphere of LL. This eliminates, for example, a subset of C ++ that does not contain templates and preprocessors, although the usual implementation of the compiler uses recursive descent partisans, since reverse tracking is required to resolve the "Most Vexing Parse" ambiguity. (In any case, you cannot delete templates with C ++, and parsing with templates is very difficult. [Note 3])
Java and Python were developed as LL (1) pseudo-parsing. I believe that Haskell falls into this category. C cannot be parsed correctly without semantic information (classic ambiguity: is there (x)*(y) listing or multiplication? - it depends on whether t211 has been programmed or declared as a variable), but I'm sure it can be captured using the recursive descent parser without backtracking. I did not look at C # in depth, but Eric Lippert's answer suggests that in some cases it requires a retreat.

Notes

Of course, people still do this, and in many cases for good reason (for example, lead to better error messages). But the "hard to build LALR parser" is not a good reason, as it is not.
This is not LL, because I am not a left factor. But this is easy enough to do; I will leave it as an exercise.
See Is C ++ context-sensitive or context-sensitive? . Also Todd Veldhuizen classic C ++ Templates are Turing Complete

What are the modern computer languages ​​LL (1)?

Notes

More articles:

What are the modern computer languages LL (1)?