Detect any combining character in Java

I am looking for a way to determine if a character in a java string is a "combining character" or not. For example,

String khmerCombiningVowel = new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0 

represents the unifying sign of the Khmer Rouge . I tried the "\\p{InCombiningDiacriticalMarks}" regex , but that doesn't seem to apply to these specific character combinations. Or even if there is some sort of exhaustive list of all Unicode character combinations, could I create a regular expression for them?

+2
java regex unicode combining-marks
Mar 17 '15 at 10:25
source share
1 answer

According to the Unicode character combination check algorithm, there are several blocks for combining characters.

Java has a number of useful features, try:

 String codePointStr = new String(new byte[]{(byte) 0xe1, (byte) 0x9f, (byte) 0x80}, "UTF-8"); // unicode 17c0 System.out.println(codePointStr.matches("\\p{Mc}")); System.out.println( Character.COMBINING_SPACING_MARK == Character.getType(codePointStr.codePointAt(0))); 

(prints true in both cases)

In this case, COMBINING_SPACING_MARK (and the associated regular expression \p{gc=Mc} ) both belong to the Unicode Category "Mark, Spacing Combining", which basically represents any character that matches the previous character, and also adds width.

Other regular expressions that may be useful are: \p{M} for any character . If you want to use the Character constants getType() , you can get the same behavior by checking if its type is COMBINING_SPACING_MARK or ENCLOSING_MARK , or NON_SPACING_MARK .

ENCLOSING_MARK is a surrounding symbol, like a circle, also adds width to the symbol with which it is combined.

NON_SPACING_MARK includes Latin alphabetic diacritical combinations of labels, etc. (Characters that mostly go above or below, and don't add any width to the character).

+5
Mar 17 '15 at 22:42
source share



All Articles