Implement Grammar class in C #

Question

Implement Grammar class in C #

Grammar by definition contains statements, an example of a very simple grammar:

E -> E + E E -> n

I want to implement the Grammar class in C #, but I'm not sure how to store production processes, for example, how to make the difference between a terminal and nonterminal character. I thought about:

 struct Production { String Left; // for example E String Right; // for example + }

There will always be a nonterminal symbol on the left (this is about context-free grammars) But the right side of the production may contain terminal and nonterminal symbols

So now I am thinking of two implementation methods:

Nonterminal characters will be written using parentheses, for example:
E + E will be represented as the string "[E] + [E]"
Create an Additional NonTerminal Data Structure
struct NonTerminal {String character; }

and E + E will be represented as an array / list:

 [new NonTerminal("E"), "+", new NonTerminal("E")]

but think that there are better ideas, it would be helpful to hear some kind of answer

+6

c # context-free-grammar nlp grammar

dfens Oct 21 '10 at 11:36

source share

3 answers

here is my idea of storing productions:

 Dictionary<NonTerminalSymbol, List<Symbol>>

Where

Symbol is the parent (abstract?) Class for NonTerminalSymbol , TerminalSymbol and Production class

therefore, in your example, that dictionary will have one key ("E") and two values in the corresponding list ("[E] + [E]" and "n").

+2

zgorawski Oct 21 '10 at 11:55

source share

It might be useful to use extension methods for the second method:

 static class StringEx { public static NonTerminal NonTerminal(this string obj) { return new NonTerminal(obj); } }

so that it looks like this:

 ["E".NonTerminal(), "+", "E".NotTerminal()]

The advantage of this method is that it would be easy to change the code.

0

Ai_boy Oct 23 '10 at 20:51

source share

Ira Baxter · Accepted Answer · 2010-10-24T09:43:45+0000

I would use

  Dictionary<NonTerminalSymbol,Set<List<Symbol>>>

allowing you to search for the Non-terminal of the set of rules for the production of the right-hand sides (themselves represented in the form of lists of terminal / non-terminal symbols) associated with the Non-terminal. (The OP question shows that Nonterminal E can be connected with two rules, but we only need the right parts if we have the left side).

This representation only works to determine the grammar of vanilla BNF, in which there is no syntactic sugar for common grammatically identifying idioms. Such idioms usually include choices, Kleene star / plus, ... and when they are available for grammar, you get what is called Extended BNF or EBNF. If we write EBNF only for the selection indicated by the symbol | , the grammar of an expression in a flat form, indicated by the OP as an example:

  E = S ; S = P | S + P | S - P ; P = T | P * T | P / T ; T = T ** M | ( E ) | Number | ID ;

and my first sentence may represent this, because alternation is only used to display different right sides. However, he will not represent this:

  E = S ; S = PA* ; A = + P | - P ; P = T M+ ; -- to be different M = * T | / T ; T = T ** M | ( E ) | Number | ID | ID ( E ( # | C) * ) ; -- function call with skipped parameters C = , E ;

The key issue that this additional notation introduces is the ability to re-compose WBNF statements by syntax definitions and that is the whole point of EBNF.

To represent EBNFs, you have to store derivatives essentially like trees that represent the structure of EBNF expressions (in fact, this is essentially the same problem as any grammar of an expression).

To represent an EBNF tree (expression), you need to define a tree structure for EBNF. You need tree nodes to:

characters (terminal or not)
Alternation (with a list of alternatives)
Kleene *
Kleene +
"Not necessary"?
others that you have decided that your EBNF has as operators (for example, comma'd lists, a way of saying that one has a list of grammar elements separated by a selected comma, or ends with a selected semicolon ,. ..)

The easiest way to do this is to first write an EBNF grammar for the EBNF itself:

 EBNF = RULE+ ; RULE = LHS "=" TERM* ";" ; TERM = STRING | SYMBOL | TERM "*" | TERM "+" | ';' STRING TERM | "," TERM STRING "(" TERM* ")" ;

Notice that I added a comma and a comma list to EBNF (advanced, remember?)

Now we can just check EBNF to decide what is needed. Now you need a set of records (OK, classes for C # 'er) to represent each of these rules. So:

a class for EBNF that contains a set of rules
class for RULE having LHS symbol and LIST
an abstract base class for TERM with several specific variations, one for each alternative to TERM (the so-called “discriminated union”, which is usually implemented by inheriting and checking instances in the OO language).

Please note that some specific options may refer to other types of classes in the view of how you get the tree. For example:

  KleeneStar inherits_from TERM { T: TERM: }

Details left to the reader to encode the rest.

This creates an unidentified problem for the OP: how do you use this grammar view to control parsing strings?

The simple answer is to get a parser generator, which means you need to figure out which EBNF it uses. (In this case, it would be easier to save your EBNF as text and pass it to this parser generator, which makes it all be discussed).

If you cannot get one (?) Or want to build your own, well, now you have an idea that you need to get over to build it. Another alternative is to create a recursive parser based on this view for your parsing. The approach to this is too great to restrain in this answer, but it is simple for those who have experience of recursion.

EDIT 10/22: OP clarifies that he insists on parsing all contextual free grammars and "especially NL". For all contextual free grammars, he will need a very strong parsing mechanism (Earley, GLR, full rollback, ...). For a natural language, he will need parsers that are much stronger than those, people have been trying to create such parsers for decades with only some, but definitely not easy, successes. Any of these two requirements seems to make the discussion of grammar representations pretty pointless; if it is a direct contextual free grammar, it will not analyze natural language (proven by those guys who have been trying for decades), and if he wants a more powerful NL parser, he will just need to use what types of bleeding edges are produced. Consider me a pessimist for his likely success if he does not decide to become a true expert in NL analysis.

Implement Grammar class in C #

More articles: