Regex count with threads

I am trying to count the number of matches of a regular expression pattern with a simple Java 8 lambdas / streams solution. For example, for this pattern / pairing:

final Pattern pattern = Pattern.compile("\\d+"); final Matcher matcher = pattern.matcher("1,2,3,4"); 

There is a splitAsStream method that breaks text into a given pattern instead of matching the pattern. Although it is elegant and unchanged, it is not always correct:

 // count is 4, correct final long count = pattern.splitAsStream("1,2,3,4").count(); // count is 0, wrong final long count = pattern.splitAsStream("1").count(); 

I also tried (ab) to use IntStream . The problem is that I have to guess how many times I have to call matcher.find() , and not until it returns false.

 final long count = IntStream .iterate(0, i -> matcher.find() ? 1 : 0) .limit(100) .sum(); 

I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there an easy way to do this using Java 8 lambdas / streams?

+7
java regex java-8 java-stream
source share
5 answers

To use Pattern::splitAsStream , you must invert your regular expression. This means that instead of \\d+ (which would be split on each number), you should use \\d+ . This gives you the number in your string.

 final Pattern pattern = Pattern.compile("\\D+"); // count is 4 long count = pattern.splitAsStream("1,2,3,4").count(); // count is 1 count = pattern.splitAsStream("1").count(); 
+4
source share

Rather far-fetched language in javadoc Pattern.splitAsStream is probably to blame.

The stream returned by this method contains each substring of the input sequence that ends with another subsequence that matches this pattern , or ends at the end of the input sequence,

If you print all matches 1,2,3,4 , you may be surprised to notice that it really returns commas , not numbers.

  System.out.println("[" + pattern.splitAsStream("1,2,3,4") .collect(Collectors.joining("!")) + "]"); 

displays [!,!,!,] . An odd bit is why it gives you 4 , not 3 .

Obviously, this also explains why "1" gives 0 , because there are no lines in between .

Quick demo:

 private void test(Pattern pattern, String s) { System.out.println(s + "-[" + pattern.splitAsStream(s) .collect(Collectors.joining("!")) + "]"); } public void test() { final Pattern pattern = Pattern.compile("\\d+"); test(pattern, "1,2,3,4"); test(pattern, "a1b2c3d4e"); test(pattern, "1"); } 

prints

 1,2,3,4-[!,!,!,] a1b2c3d4e-[a!b!c!d!e] 1-[] 
+3
source share

You can extend AbstractSpliterator to solve this problem:

 static class SpliterMatcher extends AbstractSpliterator<Integer> { private final Matcher m; public SpliterMatcher(Matcher m) { super(Long.MAX_VALUE, NONNULL | IMMUTABLE); this.m = m; } @Override public boolean tryAdvance(Consumer<? super Integer> action) { boolean found = m.find(); if (found) action.accept(m.groupCount()); return found; } } final Pattern pattern = Pattern.compile("\\d+"); Matcher matcher = pattern.matcher("1"); long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count(); System.out.println("Count: " + count); // 1 matcher = pattern.matcher("1,2,3,4"); count = StreamSupport.stream(new SpliterMatcher(matcher), false).count(); System.out.println("Count: " + count); // 4 matcher = pattern.matcher("foobar"); count = StreamSupport.stream(new SpliterMatcher(matcher), false).count(); System.out.println("Count: " + count); // 0 
+3
source share

Soon you have stream of String and String pattern : how many of these strings match this pattern?

 final String myString = "1,2,3,4"; Long count = Arrays.stream(myString.split(",")) .filter(str -> str.matches("\\d+")) .count(); 

where the first line may be another way to stream List<String>().stream() , ...

I am wrong?

+1
source share

Java 9

You can use Matcher#results() to get all matches:

Stream<MatchResult> results()
Returns a stream of match results for each sequence of the input sequence matching the pattern. The results of the match occur in the same order as the corresponding subsequences in the input sequence.

Java 8 and below

Another simple solution based on using a reverse pattern:

 String pattern = "\\D+"; System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1 

Here, all non-digital data is removed from the beginning and end of the line, and then the line is separated by non-digital sequences without reporting any empty trailing space elements (since 0 is passed as the split argument).

See this demo :

 String pattern = "\\D+"; System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1 System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3 System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1 System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1 System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2 
0
source share

All Articles