Java support for Unicode characters without BMP (i.e. code points> 0xFFFF) in their regex library?

Question

Java support for Unicode characters without BMP (i.e. code points> 0xFFFF) in their regex library?

I am currently using Java 6 (I have no way to upgrade to Java 7) and I'm trying to use the java.util.regex package to match string patterns containing Unicode characters.

I know that java.lang.String supports extra characters (i.e. characters with code points> 0xFFFF) (since Java 5), but I don't see an easy way to do a mapping with these characters. java.util.regex.Pattern still allows the display of hexadecimal numbers using 4 digits (e.g. \ uFFFF)

Does anyone know if the API is missing here?

+5

java regex unicode astral-plane

Jin kim Mar 23 '11 at 18:06

source share

2 answers

- UTF-8 . . .

- , , Javas UTF-16, . Unicode JDK7, \x{HHHHH}. charclass, \x{H..H}.

, , . UTF-16 . , UTF-8 UTF-32, . , .

+2

tchrist 16 . '11 0:05

leonbloy · Accepted Answer · 2011-03-23T18:41:21+0000

I never made comparisons with extra characters, but I think it's as simple as encoding (in patterns and strings) as two 16-bit numbers (surrogate pair UTF-16) \ unnnn \ ummmm. java.util.regex ~~must be~~ smart enough to interpret these two numbers (Java characters) as one character in patterns and strings (although Java will still treat them as two characters, as elements of a string).

Two links:

Java Unicode Encoding

http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

From the last link (referring to Java 5):

The java.util.regex package has been updated so that both pattern strings and target strings can contain extra characters that will be treated as full units.

, UTF8 ( ), (. " " ).

:

    String pat1 = ".*\uD840\uDC00{2}.*";
    String s1  = "HI \uD840\uDC00\uD840\uDC00 BYE";
    System.out.println(s1.matches(pat1) + " len=" + s1.length());

    String pat2 = ".*\u0040\u0041{2}.*";
    String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
    System.out.println(s2.matches(pat2) + " len=" + s2.length());

, Java 6,

true len=11
false len=11

. , java ( 16- , Unicode) {2} (= ). BMP, - , .

, ( Java , Java, Unicode).

Java support for Unicode characters without BMP (i.e. code points> 0xFFFF) in their regex library?

More articles: