ANTLR grammar for reStructuredText (rule priorities)

First stream of questions

Hello to all,

This may be a follow-up on this subject: Antlr rule priorities

I am trying to write an ANTLR grammar for reStructuredText markup language .

The main problem I encountered is: "How to match any sequence of characters (plain text) without masking other grammar rules?"

Take an example with a paragraph with inline markup:

In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element *before* the insert point. The variable ``after_ptr`` points to the element *after* the insert. In other words, we are going to put our new element **in between** ``before_ptr`` and ``after_ptr``. 

I thought writing rules for inline markup text would be easy. So I wrote a simple grammar:

 grammar Rst; options { output=AST; language=Java; backtrack=true; //memoize=true; } @members { boolean inInlineMarkup = false; } // PARSER text : inline_markup (WS? inline_markup)* WS? EOF ; inline_markup @after { inInlineMarkup = false; } : {!inInlineMarkup}? (emphasis|strong|litteral|link) ; emphasis @init { inInlineMarkup = true; } : '*' (~'*')+ '*' {System.out.println("emphasis: " + $text);} ; strong @init { inInlineMarkup = true; } : '**' (~'*')+ '**' {System.out.println("bold: " + $text);} ; litteral @init { inInlineMarkup = true; } : '``' (~'`')+ '``' {System.out.println("litteral: " + $text);} ; link @init { inInlineMarkup = true; } : inline_internal_target | footnote_reference | hyperlink_reference ; inline_internal_target : '_`' (~'`')+ '`' {System.out.println("inline_internal_target: " + $text);} ; footnote_reference : '[' (~']')+ ']_' {System.out.println("footnote_reference: " + $text);} ; hyperlink_reference : ~(' '|'\t'|'\u000C'|'_')+ '_' {System.out.println("hyperlink_reference: " + $text);} | '`' (~'`')+ '`_' {System.out.println("hyperlink_reference (long): " + $text);} ; // LEXER WS : (' '|'\t'|'\u000C')+ ; NEWLINE : '\r'? '\n' ; 

This simple grammar does not work. And I didn’t even try to match plain text ...

My questions:

  • Can someone point out my mistakes and maybe give me a hint on how to match plain text?
  • Is there a way to set priority in grammar rules? Perhaps this may be an occasion.

Thank you in advance for your help :-)

Robin


Second stream of questions

Many thanks for your help! It would be hard for me to understand my mistakes ... I am not writing this grammar (only) to find out ANTLR, I am trying to code the IDE plugin for eclipse. And for this I need a grammar;)

I managed to go further in grammar and wrote a text rule:

 grammar Rst; options { output=AST; language=Java; } @members { boolean inInlineMarkup = false; } ////////////////// // PARSER RULES // ////////////////// file : line* EOF ; line : text* NEWLINE ; text : inline_markup | normal_text ; inline_markup @after { inInlineMarkup = false; } : {!inInlineMarkup}? {inInlineMarkup = true;} ( | STRONG | EMPHASIS | LITTERAL | INTERPRETED_TEXT | SUBSTITUTION_REFERENCE | link ) ; link : INLINE_INTERNAL_TARGET | FOOTNOTE_REFERENCE | HYPERLINK_REFERENCE ; normal_text : {!inInlineMarkup}? ~(EMPHASIS |SUBSTITUTION_REFERENCE |STRONG |LITTERAL |INTERPRETED_TEXT |INLINE_INTERNAL_TARGET |FOOTNOTE_REFERENCE |HYPERLINK_REFERENCE |NEWLINE ) ; ////////////////// // LEXER TOKENS // ////////////////// EMPHASIS : STAR ANY_BUT_STAR+ STAR {System.out.println("EMPHASIS: " + $text);} ; SUBSTITUTION_REFERENCE : PIPE ANY_BUT_PIPE+ PIPE {System.out.println("SUBST_REF: " + $text);} ; STRONG : STAR STAR ANY_BUT_STAR+ STAR STAR {System.out.println("STRONG: " + $text);} ; LITTERAL : BACKTICK BACKTICK ANY_BUT_BACKTICK+ BACKTICK BACKTICK {System.out.println("LITTERAL: " + $text);} ; INTERPRETED_TEXT : BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("LITTERAL: " + $text);} ; INLINE_INTERNAL_TARGET : UNDERSCORE BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("INLINE_INTERNAL_TARGET: " + $text);} ; FOOTNOTE_REFERENCE : L_BRACKET ANY_BUT_BRACKET+ R_BRACKET UNDERSCORE {System.out.println("FOOTNOTE_REFERENCE: " + $text);} ; HYPERLINK_REFERENCE : BACKTICK ANY_BUT_BACKTICK+ BACKTICK UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (long): " + $text);} | ANY_BUT_ENDLINK+ UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (short): " + $text);} ; WS : (' '|'\t')+ {$channel=HIDDEN;} ; NEWLINE : '\r'? '\n' {$channel=HIDDEN;} ; /////////////// // FRAGMENTS // /////////////// fragment ANY_BUT_PIPE : ESC PIPE | ~(PIPE|'\n'|'\r') ; fragment ANY_BUT_BRACKET : ESC R_BRACKET | ~(R_BRACKET|'\n'|'\r') ; fragment ANY_BUT_STAR : ESC STAR | ~(STAR|'\n'|'\r') ; fragment ANY_BUT_BACKTICK : ESC BACKTICK | ~(BACKTICK|'\n'|'\r') ; fragment ANY_BUT_ENDLINK : ~(UNDERSCORE|' '|'\t'|'\n'|'\r') ; fragment ESC : '\\' ; fragment STAR : '*' ; fragment BACKTICK : '`' ; fragment PIPE : '|' ; fragment L_BRACKET : '[' ; fragment R_BRACKET : ']' ; fragment UNDERSCORE : '_' ; 

Grammar works fine for inline_markup, but normal_text does not map.

Here is my test class:

 import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import org.antlr.runtime.ANTLRStringStream; import org.antlr.runtime.CommonTokenStream; import org.antlr.runtime.RecognitionException; import org.antlr.runtime.tree.Tree; public class Test { public static void main(String[] args) throws RecognitionException, IOException { InputStream is = Test.class.getResourceAsStream("test.rst"); Reader r = new InputStreamReader(is); StringBuilder source = new StringBuilder(); char[] buffer = new char[1024]; int readLenght = 0; while ((readLenght = r.read(buffer)) > 0) { if (readLenght < buffer.length) { source.append(buffer, 0, readLenght); } else { source.append(buffer); } } r.close(); System.out.println(source.toString()); ANTLRStringStream in = new ANTLRStringStream(source.toString()); RstLexer lexer = new RstLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); RstParser parser = new RstParser(tokens); RstParser.file_return out = parser.file(); System.out.println(((Tree)out.getTree()).toStringTree()); } } 

And the input file that I use:

 In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element *before* the insert point. The variable ``after_ptr`` points to the |element| *after* the insert. In other words, `we are going`_ to put_ our new element **in between** ``before_ptr`` and ``after_ptr``. 

And I get this output:

 HYPERLINK_REFERENCE (short): 7-6`_ line 1:2 mismatched character ' ' expecting '_' line 1:10 mismatched character ' ' expecting '_' line 1:18 mismatched character ' ' expecting '_' line 1:21 mismatched character ' ' expecting '_' line 1:26 mismatched character ' ' expecting '_' line 1:37 mismatched character ' ' expecting '_' LITTERAL: `before_ptr` line 1:86 no viable alternative at character '\r' line 1:55 mismatched character ' ' expecting '_' line 1:60 mismatched character ' ' expecting '_' line 1:63 mismatched character ' ' expecting '_' line 1:70 mismatched character ' ' expecting '_' line 1:73 mismatched character ' ' expecting '_' line 1:77 mismatched character ' ' expecting '_' line 1:85 mismatched character ' ' expecting '_' EMPHASIS: *before* line 2:12 mismatched character ' ' expecting '_' line 2:19 mismatched character ' ' expecting '_' line 2:26 mismatched character ' ' expecting '_' LITTERAL: `after_ptr` line 2:30 mismatched character ' ' expecting '_' line 2:39 mismatched character ' ' expecting '_' line 2:90 no viable alternative at character '\r' line 2:60 mismatched character ' ' expecting '_' line 2:63 mismatched character ' ' expecting '_' line 2:67 mismatched character ' ' expecting '_' line 2:77 mismatched character ' ' expecting '_' line 2:85 mismatched character ' ' expecting '_' line 2:89 mismatched character ' ' expecting '_' line 3:7 mismatched character ' ' expecting '_' line 3:10 mismatched character ' ' expecting '_' line 3:16 mismatched character ' ' expecting '_' line 3:23 mismatched character ' ' expecting '_' line 3:27 mismatched character ' ' expecting '_' line 3:31 mismatched character ' ' expecting '_' line 3:42 mismatched character ' ' expecting '_' line 3:51 mismatched character ' ' expecting '_' line 3:55 mismatched character ' ' expecting '_' line 3:63 mismatched character ' ' expecting '_' line 3:94 mismatched character '\r' expecting '*' line 4:3 mismatched character ' ' expecting '_' line 4:18 no viable alternative at character '\r' line 4:18 mismatched character '\r' expecting '_' HYPERLINK_REFERENCE (short): oing`_ HYPERLINK_REFERENCE (short): ut_ EMPHASIS: *in between* LITTERAL: `after_ptr` BR.recoverFromMismatchedToken line 0:-1 mismatched input '<EOF>' expecting NEWLINE null 

Can you point out my mistakes? (the parser works for built-in markup without errors when I add the filter = true parameter; option for grammar)

Robin

+4
source share
2 answers

Here is a quick demonstration of how you could parse this reStructeredText. Note that it only handles an insignificant set of all available markup syntax, and adding more to it will affect the existing parser / lexer rules: so much more needs to be done.

Demo

 grammar RST; options { output=AST; backtrack=true; memoize=true; } tokens { ROOT; PARAGRAPH; INDENTATION; LINE; WORD; BOLD; ITALIC; INTERPRETED_TEXT; INLINE_LITERAL; REFERENCE; } parse : paragraph+ EOF -> ^(ROOT paragraph+) ; paragraph : line+ -> ^(PARAGRAPH line+) | Space* LineBreak -> /* omit line-breaks between paragraphs from AST */ ; line : indentation text+ LineBreak -> ^(LINE text+) ; indentation : Space* -> ^(INDENTATION Space*) ; text : styledText | interpretedText | inlineLiteral | reference | Space | Star | EscapeSequence | Any ; styledText : bold | italic ; bold : Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+) ; italic : Star italicAtom+ Star -> ^(ITALIC italicAtom+) ; boldAtom : ~(Star | LineBreak) | italic ; italicAtom : ~(Star | LineBreak) | bold ; interpretedText : BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms) ; interpretedTextAtoms : ~BackTick+ ; inlineLiteral : BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms) ; inlineLiteralAtoms : inlineLiteralAtom+ ; inlineLiteralAtom : ~BackTick | BackTick ~BackTick ; reference : Any+ UnderScore -> ^(REFERENCE Any+) ; UnderScore : '_' ; BackTick : '`' ; Star : '*' ; Space : ' ' | '\t' ; EscapeSequence : '\\' ('\\' | '*') ; LineBreak : '\r'? '\n' | '\r' ; Any : . ; 

When you create a parser and lexer from the above, and let it parse the following input file:

  *** x *** ** yyy ** * zz * *
 abc

 P2 `` * a * `b``, q`
 Python_

(pay attention to the trailing line break!)

The analyzer will produce the following AST:

enter image description here

EDIT

You can create a graph by running this class:

 import org.antlr.runtime.*; import org.antlr.runtime.tree.*; import org.antlr.stringtemplate.*; public class Main { public static void main(String[] args) throws Exception { String source = "***x*** **yyy** *zz* *\n" + "abc\n" + "\n" + "P2 ``*a*`b`` `q`\n" + "Python_\n"; RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source)); RSTParser parser = new RSTParser(new CommonTokenStream(lexer)); CommonTree tree = (CommonTree)parser.parse().getTree(); DOTTreeGenerator gen = new DOTTreeGenerator(); StringTemplate st = gen.toDOT(tree); System.out.println(st); } } 

or if your source comes from a file, do:

 RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst")); 

or

 RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???")); 

where is "???" is the encoding of your file.

The class above will print AST as a DOT file on the console. You can use the DOT viewer to display AST. In this case, I posted the image created by kgraphviewer . But there are many more viewers around . A good online one is that which seems to use the kgraphviewer under the “hood”. Good luck

+6
source

Robin wrote:

I thought writing rules for inline markup text would be easy

I have to admit that I am not familiar with this markup language, but it seems to resemble BB-Code or Wiki markup, which are not easy to translate into ANTLR grammar! These languages ​​do not make it easy to recognize tokens, because they depend on where these tokens are. Sometimes white spaces have special meaning (with lists of definitions). So no, it's not at all easy, IMO. Therefore, if this is just an exercise to familiarize yourself with ANTLRs (or the parser generators in general), I highly recommend choosing something else for parsing.

Robin wrote:

Can someone point out my mistakes and maybe give me a hint on how to fit the regular text?

You must first understand that ANTLR creates a lexer (tokenizer) and a parser. Lexer rules begin with an uppercase letter, and parser rules start with a lowercase. The parser can only work with tokens (objects that are executed by lexer rules). To maintain order, you should not use literal tokens inside parser rules (see Rule q in the grammar below). Also, ~ (negation) meta char has a different meaning depending on where it is used (in a parser or lexer rule).

Take the following grammar:

 p : T; q : ~'z'; T : ~'x'; U : 'y'; 

ANTLR will first "move" the literal 'z' to the lexer rule, for example:

 p : T; q : ~RANDOM_NAME; T : ~'x'; U : 'y'; RANDOM_NAME : 'z'; 

(the name RANDOM_NAME not used, but it does not matter). Now the q parsing rule does not match any character other than 'z' ! Negation inside the analyzer rule cancels the token (or lexer rule). Thus, ~RANDOM_NAME will match either the lexer T rule or the lexer U rule.

Inside lexer rules, ~ cancels the (single!) Character. Thus, the lexer T rule will match any character in the range \u0000 .. \uFFFF except for 'x' . Note that the following rule: ~'ab' not valid inside the lexer rule: you can only undo single character sets.

So all these ~'???' inside the rules of your parser are wrong (wrong, as in: they don’t behave the way you expect from them).

Robin wrote:

Is there a way to set priority in grammar rules? Perhaps this may be an occasion.

Yes, the order is from top to bottom in both lexer and parser rules (where top has the highest priority). Let's say parse is the entry point of your grammar:

 parse : p | q ; 

then p will be checked first, and if that fails, q tries to match.

Regarding lexer rules, rules that are keywords, for example, are matched before a rule that can match the specified keywords:

 // first keywords: WHILE : 'while'; IF : 'if' ELSE : 'else'; // and only then, the identifier rule: ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*; 
+4
source

All Articles