Negative regex weirdness in Java regex

I am trying to understand the behavior of regular expressions in Java and have run into something that seems very strange. In the code below, the test ended unexpectedly for reasons that I don’t understand when testing, with the message label "6 letters, consistent, negative" (the next two tests also do not work). Did I look at this for too long, or did something really strange happen? This is not the case with a negative forecast with a variable length (?! X), but I would be glad to hear any theories or even confirmation that others are experiencing the same problem and that this is not specific to my JVM. Sorry that regex is so far-fetched, but you don't want to see the real thing :)

// $ java -version // java version "1.7.0_10" // Java(TM) SE Runtime Environment (build 1.7.0_10-b18) // Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) // test of word without agreement String test = "plusieurs personne sont"; // match the pattern with curly braces assertTrue("no letters matched", Pattern.compile("plusieurs personne\\b").matcher(test).find()); assertTrue("1 letters matched", Pattern.compile("plusieurs personn\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("2 letters matched", Pattern.compile("plusieurs person\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("3 letters matched", Pattern.compile("plusieurs perso\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("4 letters matched", Pattern.compile("plusieurs pers\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("5 letters matched", Pattern.compile("plusieurs per\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("6 letters matched", Pattern.compile("plusieurs pe\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("7 letters matched", Pattern.compile("plusieurs p\\p{Alpha}{1,100}\\b").matcher(test).find()); assertTrue("8 letters matched", Pattern.compile("plusieurs \\p{Alpha}{1,100}\\b").matcher(test).find()); // match the negative pattern (without s or x) with curly braces assertTrue("no letters matched, negative", Pattern.compile("plusieurs (?!personne[sx])\\w+").matcher(test).find()); assertTrue("1 letters matched, negative", Pattern.compile("plusieurs (?!personn\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("2 letters matched, negative", Pattern.compile("plusieurs (?!person\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("3 letters matched, negative", Pattern.compile("plusieurs (?!perso\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("4 letters matched, negative", Pattern.compile("plusieurs (?!pers\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("5 letters matched, negative", Pattern.compile("plusieurs (?!per\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); // the assertion below fails (is false) for reasons unknown assertTrue("6 letters matched, negative", Pattern.compile("plusieurs (?!pe\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("7 letters matched, negative", Pattern.compile("plusieurs (?!p\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); assertTrue("8 letters matched, negative", Pattern.compile("plusieurs (?!\\p{Alpha}{1,100}[sx])\\w+").matcher(test).find()); 
+4
source share
1 answer

See what the view looks like:

 pe literal, matches "pe" r matches \p{Alpha}{1,100} s matches [sx] 

So a negative lookahead doesn't match (the tail of your line, "onne sont" , doesn't matter here).

Placing \\b after [sx] may help if your idea is that the next word should not end with s or x. Always keep in mind that a negative look will not regret failure, and it will not return to find how it is possible that your regular expression does not match.

UPD: let's look closer to case 5 to compare it with case 6. Here we are working with a hypothetical match (for expression inside lookahead), so we should consider several options for how this could happen (almost).

 per literal, would match "per" -- it always so -- let try to imagine how the rest could match: sonn would match \p{Alpha}{1,100} e wouldn't match [sx], FAIL -- or maybe s would match \p{Alpha}{1,100} o wouldn't match [sx], FAIL -- or maybe yet so would match \p{Alpha}{1,100} n wouldn't match [sx], FAIL. 

We would have another interesting adventure if the second word were "personali s ation".

UPD2: the discussion in the comment prompted me to add a generalization here:

Regular expressions are attractive because they share an important feature of the human mind: rejection of confirmation . When we write regular expressions, we want them to match each other; even if our task is to prevent invalid data entry, we often think of valid inputs. The regex regulator usually shares this property: it wants a match and hates a failure. Therefore, a subexpression like \p{Alpha}{1,100} does not mean “there is the longest piece of Alphas” before trying to match the rest of the input. ”This means, roughly speaking,“ consider all possible Alpha fragments of length within [1,100] , find a way to match the entire expression. "

This is why regular expressions make it easy to miss the false positives of complex expressions: incorrect inputs that are mistaken. This problem does not appear, it becomes more noticeable when we use negative images:

Inside a negative lookahead, the regular expression matches the internal expression (which leads to a violation of the expression). The human programmer still wants to match the external expression; as we see in our example, this factor affects our discussion of internal expression. We think that he should not try so hard to match (and, for example, that he should process subexpressions stupidly, instantly using the longest input). The connector works as usual, but our ideas about the desired behavior are now out of sync with its algorithm. And false positives for internal expression (which are difficult to notice) become false negatives for external expression (which we notice and hate).

+3
source

All Articles