I would use
Dictionary<NonTerminalSymbol,Set<List<Symbol>>>
allowing you to search for the Non-terminal of the set of rules for the production of the right-hand sides (themselves represented in the form of lists of terminal / non-terminal symbols) associated with the Non-terminal. (The OP question shows that Nonterminal E can be connected with two rules, but we only need the right parts if we have the left side).
This representation only works to determine the grammar of vanilla BNF, in which there is no syntactic sugar for common grammatically identifying idioms. Such idioms usually include choices, Kleene star / plus, ... and when they are available for grammar, you get what is called Extended BNF or EBNF. If we write EBNF only for the selection indicated by the symbol | , the grammar of an expression in a flat form, indicated by the OP as an example:
E = S ; S = P | S + P | S - P ; P = T | P * T | P / T ; T = T ** M | ( E ) | Number | ID ;
and my first sentence may represent this, because alternation is only used to display different right sides. However, he will not represent this:
E = S ; S = PA* ; A = + P | - P ; P = T M+ ; -- to be different M = * T | / T ; T = T ** M | ( E ) | Number | ID | ID ( E (
The key issue that this additional notation introduces is the ability to re-compose WBNF statements by syntax definitions and that is the whole point of EBNF.
To represent EBNFs, you have to store derivatives essentially like trees that represent the structure of EBNF expressions (in fact, this is essentially the same problem as any grammar of an expression).
To represent an EBNF tree (expression), you need to define a tree structure for EBNF. You need tree nodes to:
- characters (terminal or not)
- Alternation (with a list of alternatives)
- Kleene *
- Kleene +
- "Not necessary"?
- others that you have decided that your EBNF has as operators (for example, comma'd lists, a way of saying that one has a list of grammar elements separated by a selected comma, or ends with a selected semicolon ,. ..)
The easiest way to do this is to first write an EBNF grammar for the EBNF itself:
EBNF = RULE+ ; RULE = LHS "=" TERM* ";" ; TERM = STRING | SYMBOL | TERM "*" | TERM "+" | ';' STRING TERM | "," TERM STRING "(" TERM* ")" ;
Notice that I added a comma and a comma list to EBNF (advanced, remember?)
Now we can just check EBNF to decide what is needed. Now you need a set of records (OK, classes for C # 'er) to represent each of these rules. So:
- a class for EBNF that contains a set of rules
- class for RULE having LHS symbol and LIST
- an abstract base class for TERM with several specific variations, one for each alternative to TERM (the so-called “discriminated union”, which is usually implemented by inheriting and checking instances in the OO language).
Please note that some specific options may refer to other types of classes in the view of how you get the tree. For example:
KleeneStar inherits_from TERM { T: TERM: }
Details left to the reader to encode the rest.
This creates an unidentified problem for the OP: how do you use this grammar view to control parsing strings?
The simple answer is to get a parser generator, which means you need to figure out which EBNF it uses. (In this case, it would be easier to save your EBNF as text and pass it to this parser generator, which makes it all be discussed).
If you cannot get one (?) Or want to build your own, well, now you have an idea that you need to get over to build it. Another alternative is to create a recursive parser based on this view for your parsing. The approach to this is too great to restrain in this answer, but it is simple for those who have experience of recursion.
EDIT 10/22: OP clarifies that he insists on parsing all contextual free grammars and "especially NL". For all contextual free grammars, he will need a very strong parsing mechanism (Earley, GLR, full rollback, ...). For a natural language, he will need parsers that are much stronger than those, people have been trying to create such parsers for decades with only some, but definitely not easy, successes. Any of these two requirements seems to make the discussion of grammar representations pretty pointless; if it is a direct contextual free grammar, it will not analyze natural language (proven by those guys who have been trying for decades), and if he wants a more powerful NL parser, he will just need to use what types of bleeding edges are produced. Consider me a pessimist for his likely success if he does not decide to become a true expert in NL analysis.