Why surrogate java regexp finds hypen minus -

I am trying to find why this regular expression in JAVA ([\ud800-\udbff\udc00-\udfff]) used in replaceAll(regexp,"") also removes the hypen-minus character along with surrogate characters.

The Unicode for this is \u002d , so it is not within any of these ranges.

I could easily remove this behavior by adding &&[^\u002d] , resulting in ([\ud800-\udbff\udc00-\udfff&&[^\u002d]])

But since I donโ€™t know why this \u002d is \u002d deleted, I think the more inconspicuous characters are removed.

Example:

 String text = "A\u002dB"; System.out.println(text); String regex = "([\ud800-\udbff\udc00-\udfff])"; System.out.println(text.replaceAll(regex, "X")); 

Fingerprints:

 AB AXB 
+9
java regex
Jan 07 '15 at 13:50
source share
2 answers

Overview and Assumption

Corresponding characters in the astral planes (codes U + 10000 to U + 10FFFF) were underestimated in Java regex.

This answer mainly relates to the implementation of Oracle (the reference implementation, which is also used in OpenJDK) for Java version 6 and higher.

Please check the code yourself if you are using GNU Classpath or Android, as they use their own implementation.

Behind the scenes

Assuming you use regex to implement Oracle, your regex

 "([\ud800-\udbff\udc00-\udfff])" 

compiled as such:

 StartS. Start unanchored match (minLength=1) java.util.regex.Pattern$GroupHead Pattern.union. A โˆช B: Pattern.union. A โˆช B: Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00. BitClass. Match any of these 1 character(s): [U+002D] SingleS. Match code point: U+DFFF LOW SURROGATES DFFF java.util.regex.Pattern$GroupTail java.util.regex.Pattern$LastNode Node. Accept match 

The character class is parsed as \ud800-\udbff\udc00 , - , \udfff . Since \udbff\udc00 forms valid surrogate pairs, it represents the code point U + 10FC00.

Wrong decision

It makes no sense to write:

 "[\ud800-\udbff][\udc00-\udfff]" 

Since the Oracle implementation corresponds to a code point, and the correct surrogate pairs will be converted to a code point before matching, the regular expression above can not be compared with anything, because it looks for 2 consecutive single surrogates that can form a real pair.

Decision

If you want to combine and delete all code points above U + FFFF in the astral planes (formed by a real surrogate pair), plus lonely surrogates (which cannot form a real surrogate pair), you should write:

 input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", ""); 

This solution has been tested to work in Java 6 and 7 (Oracle implementation).

The regular expression above compiles to:

 StartS. Start unanchored match (minLength=1) Pattern.union. A โˆช B: Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF. Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF. java.util.regex.Pattern$LastNode Node. Accept match 

Note that I am specifying characters with a Unicode string literal escape sequence, not an escape sequence in the regex syntax.

 // Only works in Java 7 input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "") 

Java 6 does not recognize surrogate pairs when given by regular expression syntax, so the regular expression recognizes \\ud800 as a single character and tries to compile the range \\udc00-\\udbff , where it fails. We are fortunate that it throws an exception for this input; otherwise, an error will not be detected. Java 7 parses this regular expression correctly and compiles into the same structure as above.




From Java 7 and above, the syntax \x{h..h} was added to support the specification of characters outside the BMP (Basic Multilingual Plane), and this is the recommended method for specifying characters in the astral planes.

 input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", ""); 

This regular expression is also compiled into the same structure as above.

+7
Jan 07 '15 at 15:59
source share

If you make a range

 [\ud800-\udfff] 

or

 [\ud800-\udbff\udbff-\udfff] 

it will leave the hyphen intact. This seems to be a mistake.

Please note: there is no reason for the double range, in your example \udc00 is just the next code point after \udbff so you can skip this. If you make two ranges overlap one or more code points, it works again, but you can just leave it (see My first example above).

+1
Jan 07 '15 at 14:11
source share



All Articles