Regarding this question: removing invalid XML characters from a string in java , in @McDowell's answer, he said that the way to remove invalid XML characters is:
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
and then:
replaceAll(xml10pattern, "");
Well, I have two questions:
- Shouldn't all Unicode characters be escaped? I mean
\\u0009\\u000A\\u000D...instead \u0009\r\n, as I saw in @ogrisel's answer: Removing unaccepted XML characters in Java - I do not understand how the last range is
(U+10000âU+10FFFF)converted to "\ud800\udc00-\udbff\udfff". Could it not be "\u10000-\u10FFFF"?
I really need to detect or filter such characters, and I'm not quite sure how to do this.
By the way, this should work on JDK 1.5 (so type expressions are \x{h...h}not allowed)
Many thanks.
====== ======
, String str :
if (!str.replaceAll(pattern, "").equals(str)) {
}
;)