ANTLR: Unicode Character Scan

Question

ANTLR: Unicode Character Scan

Problem. Cannot print Unicode character correctly.

Here is my grammar:

options { k=1; filter=true; // Allow any char but \uFFFF (16 bit -1) charVocabulary='\u0000'..'\uFFFE'; } ANYCHAR :'$' | '_' { System.out.println("Found underscore: "+getText()); } | 'a'..'z' { System.out.println("Found alpha: "+getText()); } | '\u0080'..'\ufffe' { System.out.println("Found unicode: "+getText()); } ;

Code snippet of the main method that calls the lexer:

 public static void main(String[] args) { SimpleLexer simpleLexer = new SimpleLexer(System.in); while(true) { try { Token t = simpleLexer.nextToken(); System.out.println("Token : "+t); } catch(Exception e) {} } }

To enter "ठ" I get the following output:

 Found unicode: Token : ["à",<5>,line=1,col=7] Found unicode: Token : ["¤",<5>,line=1,col=8] Found unicode: Token : [" ",<5>,line=1,col=9]

It looks like the lexer treats the Unicode char "ठ" as three separate characters. My goal is to scan and print "ठ".

+4

java antlr lexer

Jhakki Sep 2 '10 at 21:57

source share

1 answer

jpalecek · Accepted Answer · 2010-09-02T22:20:54+0000

Your problem is not in the vocabulary generated by ANTLR, but in the Java stream that you pass to it. A stream reads only bytes (does not interpret them in encoding), and what you see is a UTF-8 sequence.

If its ANTLR 3, you can use the ANTLRInputStream constructor, which takes the encoding as a parameter:

 ANTLRInputStream (InputStream input, String encoding) throws IOException

ANTLR: Unicode Character Scan

More articles: