Unicode Character Combining Algorithm

I intend to normalize form C, and then divide it into "display units", basically a glyph plus all subsequent character combinations. At the moment, I just want to process Latin scripts.

To determine if a code point is a combining character, is it enough to verify that it is in these ranges?

  • Combining Diacritical Marks (0300-036F)
  • Combining Diacritical Marks (1DC0-1DFF)
  • Combining Diacritical Marks for Symbols (20D0-20FF)
  • Combining Complete Characters (FE20-FE2F)

Arabic, Hebrew and various Indian scripts pending ...

+4
unicode
Jun 11 '13 at 19:00
source share
2 answers

These are all Unicode point ranges whose name contains the word "union" (for example, 301 COMBINING ACUTE ACCENT ):

300-36F
483-489
7EB-7F3
135f-135f
1A7F-1A7F
1B6B-1B73
1DC0-1DE6
1DFD-1DFF
20D0-20F0
2CEF-2CF1
2DE0-2DFF
3099-309A
A66F-A672
A67C-A67D
A6F0-A6F1
A8E0-A8F1
FE20-FE26
101FD-101FD
1D165-1D169
1D16D-1D172
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
1D242-1D244

I compiled this list using a Python script using the unicodedata module. I don't know which version of Unicode is for sure, but I think it is reasonably updated.

However, I don’t know if you ended up with characters that are “combined” in the strict sense of the word, since there are also “modifier letters” in Unicode, etc.

+2
Jun 11 '13 at 20:06 on
source share

Well, I recently hacked something like that. Enjoy it!

  public static List<String> stringToCharacterWithCombiningChars(String fullText) { Pattern splitWithCombiningChars = Pattern.compile("(\\p{M}+|\\P{M}\\p{M}*)"); // {M} is any kind of 'mark' http://stackoverflow.com/questions/29110887/detect-any-combining-character-in-java/29111105 Matcher matcher = splitWithCombiningChars.matcher(fullText); ArrayList<String> outGoing = new ArrayList<>(); while(matcher.find()) { outGoing.add(matcher.group()); } return outGoing; } 

Associated (passing) unit test, if it is worthy for followers: https://gist.github.com/rdp/0014de502f37abd64ffd

+1
Mar 18 '15 at 21:40
source share



All Articles