How to make input tokenization using Java Scanner class and regular expressions?

Question

How to make input tokenization using Java Scanner class and regular expressions?

Just for my own purposes, I'm trying to create a tokenizer in Java, where I can determine the correct grammar and enter input tokenization on it. The StringTokenizer class is deprecated, and I found a couple of functions in the scanner that hint at what I want to do, but so far no luck. Does anyone know a good way around this?

+5

java compiler-construction regex tokenize

eplawless Oct 28 '08 at 17:16

source share

4 answers

, . "", , sofistically, . , String.split(), .

, , .

 import java.util.Scanner;


  public class Main {    

    public static void main(String[] args) {

        String textToTokenize = "This is a text that will be tokenized. I will use 1-2 methods.";
        Scanner scanner = new Scanner(textToTokenize);
        scanner.useDelimiter("i.");
        while (scanner.hasNext()){
            System.out.println(scanner.next());
        }

        System.out.println(" **************** ");
        String[] sSplit = textToTokenize.split("i.");

        for (String token: sSplit){
            System.out.println(token);
        }
    }

}

+3

Balint Pato 28 . '08 18:06

( , ), , .

, , JFlex. , .

+2

Michael Myers 28 . '08 18:34

Most of the answers here are already excellent, but I would refuse if I did not specify ANTLR . I have created whole compilers around this excellent tool. Version 3 has some amazing features, and I would recommend it for any project that requires you to analyze input based on a well-defined grammar.

+2

ra9r Jun 12 '09 at 6:09

source share

Alan Moore · Accepted Answer · 2008-10-29T16:34:45+0000

The name "Scanner" is a little misleading, because the word is often used to refer to a lexical analyzer, and it is not that scanner. All this is a replacement for the function scanf()that you find in C, Perl, etc. Like StringTokenizer and split(), it is designed to scan forward until it finds a match for this template, and everything that it missed on the way is returned as a token.

, , , , . , , , . "//" , , , .

, , , , StringTokenizer, split() Scanner, . , Java regex . , Scanner API- Matcher, , usePattern(). , Java.

import java.util.*;
import java.util.regex.*;

public class RETokenizer
{
  static List<Token> tokenize(String source, List<Rule> rules)
  {
    List<Token> tokens = new ArrayList<Token>();
    int pos = 0;
    final int end = source.length();
    Matcher m = Pattern.compile("dummy").matcher(source);
    m.useTransparentBounds(true).useAnchoringBounds(false);
    while (pos < end)
    {
      m.region(pos, end);
      for (Rule r : rules)
      {
        if (m.usePattern(r.pattern).lookingAt())
        {
          tokens.add(new Token(r.name, m.start(), m.end()));
          pos = m.end();
          break;
        }
      }
      pos++;  // bump-along, in case no rule matched
    }
    return tokens;
  }

  static class Rule
  {
    final String name;
    final Pattern pattern;

    Rule(String name, String regex)
    {
      this.name = name;
      pattern = Pattern.compile(regex);
    }
  }

  static class Token
  {
    final String name;
    final int startPos;
    final int endPos;

    Token(String name, int startPos, int endPos)
    {
      this.name = name;
      this.startPos = startPos;
      this.endPos = endPos;
    }

    @Override
    public String toString()
    {
      return String.format("Token [%2d, %2d, %s]", startPos, endPos, name);
    }
  }

  public static void main(String[] args) throws Exception
  {
    List<Rule> rules = new ArrayList<Rule>();
    rules.add(new Rule("WORD", "[A-Za-z]+"));
    rules.add(new Rule("QUOTED", "\"[^\"]*+\""));
    rules.add(new Rule("COMMENT", "//.*"));
    rules.add(new Rule("WHITESPACE", "\\s+"));

    String str = "foo //in \"comment\"\nbar \"no //comment\" end";
    List<Token> result = RETokenizer.tokenize(str, rules);
    for (Token t : result)
    {
      System.out.println(t);
    }
  }
}

, , , - lookingAt().: D

How to make input tokenization using Java Scanner class and regular expressions?

More articles: