I never made comparisons with extra characters, but I think it's as simple as encoding (in patterns and strings) as two 16-bit numbers (surrogate pair UTF-16) \ unnnn \ ummmm. java.util.regex must be smart enough to interpret these two numbers (Java characters) as one character in patterns and strings (although Java will still treat them as two characters, as elements of a string).
Two links:
Java Unicode Encoding
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
From the last link (referring to Java 5):
The java.util.regex package has been updated so that both pattern strings and target strings can contain extra characters that will be treated as full units.
, UTF8 ( ), (. " " ).
:
String pat1 = ".*\uD840\uDC00{2}.*";
String s1 = "HI \uD840\uDC00\uD840\uDC00 BYE";
System.out.println(s1.matches(pat1) + " len=" + s1.length());
String pat2 = ".*\u0040\u0041{2}.*";
String s2 = "HI \u0040\u0041\u0040\u0041 BYE";
System.out.println(s2.matches(pat2) + " len=" + s2.length());
, Java 6,
true len=11
false len=11
. , java ( 16- , Unicode) {2} (= ). BMP, - , .
, ( Java , Java, Unicode).