Overview and Assumption
Corresponding characters in the astral planes (codes U + 10000 to U + 10FFFF) were underestimated in Java regex.
This answer mainly relates to the implementation of Oracle (the reference implementation, which is also used in OpenJDK) for Java version 6 and higher.
Please check the code yourself if you are using GNU Classpath or Android, as they use their own implementation.
Behind the scenes
Assuming you use regex to implement Oracle, your regex
"([\ud800-\udbff\udc00-\udfff])"
compiled as such:
StartS. Start unanchored match (minLength=1) java.util.regex.Pattern$GroupHead Pattern.union. A โช B: Pattern.union. A โช B: Pattern.rangeFor. U+D800 <= codePoint <= U+10FC00. BitClass. Match any of these 1 character(s): [U+002D] SingleS. Match code point: U+DFFF LOW SURROGATES DFFF java.util.regex.Pattern$GroupTail java.util.regex.Pattern$LastNode Node. Accept match
The character class is parsed as \ud800-\udbff\udc00 , - , \udfff . Since \udbff\udc00 forms valid surrogate pairs, it represents the code point U + 10FC00.
Wrong decision
It makes no sense to write:
"[\ud800-\udbff][\udc00-\udfff]"
Since the Oracle implementation corresponds to a code point, and the correct surrogate pairs will be converted to a code point before matching, the regular expression above can not be compared with anything, because it looks for 2 consecutive single surrogates that can form a real pair.
Decision
If you want to combine and delete all code points above U + FFFF in the astral planes (formed by a real surrogate pair), plus lonely surrogates (which cannot form a real surrogate pair), you should write:
input.replaceAll("[\ud800\udc00-\udbff\udfff\ud800-\udfff]", "");
This solution has been tested to work in Java 6 and 7 (Oracle implementation).
The regular expression above compiles to:
StartS. Start unanchored match (minLength=1) Pattern.union. A โช B: Pattern.rangeFor. U+10000 <= codePoint <= U+10FFFF. Pattern.rangeFor. U+D800 <= codePoint <= U+DFFF. java.util.regex.Pattern$LastNode Node. Accept match
Note that I am specifying characters with a Unicode string literal escape sequence, not an escape sequence in the regex syntax.
// Only works in Java 7 input.replaceAll("[\\ud800\\udc00-\\udbff\\udfff\\ud800-\\udfff]", "")
Java 6 does not recognize surrogate pairs when given by regular expression syntax, so the regular expression recognizes \\ud800 as a single character and tries to compile the range \\udc00-\\udbff , where it fails. We are fortunate that it throws an exception for this input; otherwise, an error will not be detected. Java 7 parses this regular expression correctly and compiles into the same structure as above.
From Java 7 and above, the syntax \x{h..h} was added to support the specification of characters outside the BMP (Basic Multilingual Plane), and this is the recommended method for specifying characters in the astral planes.
input.replaceAll("[\\x{10000}-\\x{10ffff}\ud800-\udfff]", "");
This regular expression is also compiled into the same structure as above.