Unicode character iteration by character

Question

Unicode character iteration by character

I have a number of Unicode codes. What I really need to do is iterate through these code points as a sequence of characters, rather than a series of code points and define the properties of each individual character, for example. this letter, whatever.

For example, imagine that I wrote a Unicode-compatible text field, and the user entered a Unicode character, which was more than one code — for example, “e with diacritical”. I know that this particular character can be represented as one code example, and it can be normalized, but I do not think that this is possible in the general case. How can I implement backspace? Obviously, he cannot just delete the last code, because they could just enter more than one code.

How can I iterate over a bunch of Unicode codes as characters?

Edit: The break iterators offered by the ICU seem to be pretty much needed by me. However, I do not use the ICU, so any references to how to implement my own equivalent functionality will be the accepted answer.

Other editing: It turns out that the Windows API does offer this feature. MSDN is just not very good with all string functions in one place. CharNext is the function I'm looking for.

+7

c ++ unicode character-properties

Puppy Nov 26 '11 at 10:05

source share

2 answers

The UTF8-CPP project contains a bunch of clean, easy to read STL-like algorithms for iterating over a Unicode codepoint, character by character, etc. You can study this for inspiration.

Please note that the “by nature” approach may not be obvious. One easy way to do this is to loop through the UTF-32 string into a normalization form C, which guarantees fixed-length encoding.

+1

André caron Nov 26 '11 at 10:11

source share

bmargulies · Accepted Answer · 2011-11-26T22:07:21+0000

Use the ICU library.

http://site.icu-project.org/

eg:

http://icu-project.org/apiref/icu4c/classUnicodeString.html#ae3ffb6e15396dff152cb459ce4008f90

is a function that returns a character at a certain character offset in the string.

Unicode character iteration by character

More articles: