Java Regular Expression Match Characters Outside the Basic Multilingual Plane

How can I match characters (with the goal of removing them) due to the scope of Unicode Basic Multilingual Plane in java?

+15
java regex unicode astral-plane
Oct 27 2018-10-27
source share
2 answers

To remove all non-BMP characters, the following should work:

String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", ""); 
+19
Oct 27 '10 at 17:19
source share

Are you looking for specific characters or all characters outside BMP?

If the first one, you can use StringBuilder to build a string containing code points from higher planes, and the regular expression will work as expected:

  String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString(); Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString()); Matcher matcher = regex.matcher(test); matcher.find(); System.out.println(matcher.start()); 

If you want to remove all non-BMP characters from a string, I would use StringBuilder directly, not a regular expression:

  StringBuilder sb = new StringBuilder(test.length()); for (int ii = 0 ; ii < test.length() ; ) { int codePoint = test.codePointAt(ii); if (codePoint > 0xFFFF) { ii += Character.charCount(codePoint); } else { sb.appendCodePoint(codePoint); ii++; } } 
+3
Oct 27 '10 at 17:10
source share



All Articles