Understanding useDelimiter in a scanner: why am I getting an empty token?

I use a scanner with a separator, and I came across strange behavior that I would like to understand.

I am using this program:

Scanner sc = new Scanner("Aller à : Navigation, rechercher"); sc.useDelimiter("\\s+|\\s*\\p{Punct}+\\s*"); String word=""; while(sc.hasNext()){ word = sc.next(); System.out.println(word); } 

Conclusion:

 Aller à Navigation rechercher 

So, at first I don’t understand why I get an empty token, the documentation says:

Depending on the type of demarcation template, empty tokens may be returned. For example, the pattern "\ s +" will not return empty tokens, since it matches multiple separator instances. The separator pattern "\ s" can return empty tokens, as it passes only one space at a time.

I use \\s+ , so why does it return an empty token?

Then there is one more thing that I would like to understand regarding regex. If I change the delimiter using a "reverse" regular expression:

  sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+"); 

The result is correct, and I get:

 Aller à Navigation rechercher 

Why does this work along the way?

EDIT:

In this case:

  Scanner sc = new Scanner("(23 ou 24 minutes pour les épisodes avec introduction) (approx.)1"); sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+"); //second regex 

I still have an empty token between introduction and approx . Can this be avoided?

+4
source share
2 answers

I get the feeling that you are causing two markup captures in places where there is a space followed by punctuation. Why not just use [\\s\\p{Punct}]+ ?

This regular expression \\s+|\\p{Punct}+ first capture the empty space and swallow it, and then commit the next delimiter as punctuation. These will be two separators next to each other, between which there is nothing (empty token).

+1
source

I also had to deal with the empty marker problem with Scanner. I think the delimiter pattern should be greedy by surrounding it with a bracket and adding + to the group. The sample I used is as follows

 "((\\s)+|(\\\\r\\\\n)+|\\p{Punct}+)+". 
0
source

Source: https://habr.com/ru/post/1414505/


All Articles