You cannot use \s in Java to match spaces in your own character set, because Java does not support the Unicode space property - although this is strictly necessary to match RTS1.2 UTS # 18! He, unfortunately, does not comply with the standards.
Unicode defines 26 code points as \p{White_Space} : 20 of them are different types of \pZ GeneralCategory = Separator, and the remaining 6 are \p{Cc} GeneralCategory = Control.
Empty space is a fairly stable property, and the same ones exist almost always. However, Java does not have a property that conforms to the Unicode standard for them, so you should instead use code like this:
String whitespace_chars = "" /* dummy empty string for homogeneity */ + "\\u0009" // CHARACTER TABULATION + "\\u000A" // LINE FEED (LF) + "\\u000B" // LINE TABULATION + "\\u000C" // FORM FEED (FF) + "\\u000D" // CARRIAGE RETURN (CR) + "\\u0020" // SPACE + "\\u0085" // NEXT LINE (NEL) + "\\u00A0" // NO-BREAK SPACE + "\\u1680" // OGHAM SPACE MARK + "\\u180E" // MONGOLIAN VOWEL SEPARATOR + "\\u2000" // EN QUAD + "\\u2001" // EM QUAD + "\\u2002" // EN SPACE + "\\u2003" // EM SPACE + "\\u2004" // THREE-PER-EM SPACE + "\\u2005" // FOUR-PER-EM SPACE + "\\u2006" // SIX-PER-EM SPACE + "\\u2007" // FIGURE SPACE + "\\u2008" // PUNCTUATION SPACE + "\\u2009" // THIN SPACE + "\\u200A" // HAIR SPACE + "\\u2028" // LINE SEPARATOR + "\\u2029" // PARAGRAPH SEPARATOR + "\\u202F" // NARROW NO-BREAK SPACE + "\\u205F" // MEDIUM MATHEMATICAL SPACE + "\\u3000" // IDEOGRAPHIC SPACE ; /* A \s that actually works for Javas native character set: Unicode */ String whitespace_charclass = "[" + whitespace_chars + "]"; /* A \S that actually works for Javas native character set: Unicode */ String not_whitespace_charclass = "[^" + whitespace_chars + "]";
Now you can use whitespace_charclass + "+" as a template in your replaceAll .
Sorry for all this. Javas regular expressions just don't work very well with their own set of custom characters, so you really need to jump through exotic hoops to make them work.
And if you think that a space is bad, you should see what you have to do so that \w and \b finally behave correctly!
Yes, it is possible, and yes, it is a crazy mess. This is even charity. The easiest way to get a standardized regular expression library for Java is to use the JNI for the ICU. This is what Google does for Android, because OraSuns is not consistent.
If you donβt want to do this, but still want to stick with Java, I have a frontend rewrite library that I wrote that βfixesβ Javas templates, at least to meet the requirements of RL1.2a in UTS # 18., Unicode Regular Expressions .
tchrist Jan 19 '11 at 2:16 2011-01-19 02:16
source share