Java support for Unicode characters without BMP (i.e. code points> 0xFFFF) in their regex library?

I am currently using Java 6 (I have no way to upgrade to Java 7) and I'm trying to use the java.util.regex package to match string patterns containing Unicode characters.

I know that java.lang.String supports extra characters (i.e. characters with code points> 0xFFFF) (since Java 5), ​​but I don't see an easy way to do a mapping with these characters. java.util.regex.Pattern still allows the display of hexadecimal numbers using 4 digits (e.g. \ uFFFF)

Does anyone know if the API is missing here?

+5
source share
2 answers

I never made comparisons with extra characters, but I think it's as simple as encoding (in patterns and strings) as two 16-bit numbers (surrogate pair UTF-16) \ unnnn \ ummmm. java.util.regex must be smart enough to interpret these two numbers (Java characters) as one character in patterns and strings (although Java will still treat them as two characters, as elements of a string).

Two links:

Java Unicode Encoding

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

From the last link (referring to Java 5):

The java.util.regex package has been updated so that both pattern strings and target strings can contain extra characters that will be treated as full units.

, UTF8 ( ), (. " " ).

:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

, Java 6,

true len=11
false len=11

. , java ( 16- , Unicode) {2} (= ). BMP, - , .

, ( Java , Java, Unicode).

+6

- UTF-8 . . .

- , , Javas UTF-16, . Unicode JDK7, \x{HHHHH}. charclass, \x{H..H}.

, , . UTF-16 . , UTF-8 UTF-32, . , .

+2

All Articles