How to read characters in a unicode string in C

Let's say I have a line:

char theString[] = "你们好āa"; 

Given that my encoding is utf-8, this string is 12 bytes long (three hanzi characters have three bytes each, a latin character with a macro is two bytes, and "a" is one byte:

 strlen(theString) == 12 

How to count the number of characters? How can I make a subscription equivalent to:

 theString[3] == "好" 

How can I slice and cat such lines?

+55
c string unicode ascii
Sep 04 '11 at 8:15
source share
10 answers

You only think that characters whose upper two bits are not set to 10 (i.e., everything is less than 0x80 or more than 0xbf ).

This is because all characters with the upper two bits set to 10 are continuation bytes of UTF-8.

See here for a description of the encoding and how strlen can work with a UTF-8 string.

For slicing and trimming UTF-8 strings, you basically have to follow the same rules. Any byte starting with bit 0 or 11 is the beginning of the UTF-8 code point, all the rest are continuation characters.

Best of all, if you do not want to use a third-party library, simply provide functions in the following areas:

 utf8left (char *destbuff, char *srcbuff, size_t sz); utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz); utf8rest (char *destbuff, char *srcbuff, size_t pos; 

to get accordingly:

  • left line sz UTF-8 lines.
  • sz bytes UTF-8 strings starting with pos .
  • the remaining bytes are UTF-8 strings starting with pos .

This will be a worthy building block, allowing you to manage strings efficiently enough for your purposes.

+26
Sep 04 '11 at 8:45
source share

The easiest way is to use a library like ICU

+17
04 Sep '11 at 8:27
source share

Try this for size:

 #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> // returns the number of utf8 code points in the buffer at s size_t utf8len(char *s) { size_t len = 0; for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len; return len; } // returns a pointer to the beginning of the pos'th utf8 codepoint // in the buffer at s char *utf8index(char *s, size_t pos) { ++pos; for (; *s; ++s) { if ((*s & 0xC0) != 0x80) --pos; if (pos == 0) return s; } return NULL; } // converts codepoint indexes start and end to byte offsets in the buffer at s void utf8slice(char *s, ssize_t *start, ssize_t *end) { char *p = utf8index(s, *start); *start = p ? p - s : -1; p = utf8index(s, *end); *end = p ? p - s : -1; } // appends the utf8 string at src to dest char *utf8cat(char *dest, char *src) { return strcat(dest, src); } // test program int main(int argc, char **argv) { // slurp all of stdin to p, with length len char *p = malloc(0); size_t len = 0; while (true) { p = realloc(p, len + 0x10000); ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000); if (cnt == -1) { perror("read"); abort(); } else if (cnt == 0) { break; } else { len += cnt; } } // do some demo operations printf("utf8len=%zu\n", utf8len(p)); ssize_t start = 2, end = 3; utf8slice(p, &start, &end); printf("utf8slice[2:3]=%.*s\n", end - start, p + start); start = 3; end = 4; utf8slice(p, &start, &end); printf("utf8slice[3:4]=%.*s\n", end - start, p + start); return 0; } 

Run Example:

 matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops utf8len=5 utf8slice[2:3]=好utf8slice[3:4]=ā 

Please note that your example is disabled by one error. theString[2] == "好"

+12
Sep 04 2018-11-11T00:
source share

Depending on your concept of “character,” this question may be more or less involved.

First, you must convert your byte string to a unicode code string. You can do this with the iconv() ICU, but if this is the only thing you do, iconv() much simpler and this is part of POSIX.

Your Unicode code point string can be something like uint32_t[] with zero completion, or if you have C1x, an array of char32_t . The size of this array (i.e., its number, not its size in bytes) is the number of code points (plus the terminator), and this should give you a very good start.

However, the notion of “printable character” is rather complicated, and you may prefer to count graphemes rather than code points, for example, a with an accent ^ can be expressed as two unicode code points, or as a combined legacy codepoint â - both are valid and both are required by the standard Unicode to handle the same. There is a process called “normalization” that turns your string into a specific version, but there are many graphemes that cannot be expressed as a single code, and generally there is no possibility around the corresponding library that understands this and considers graphemes for you.

However, it's up to you how complex your scripts are and how carefully you want to handle them. Converting to unicode code pages is a must, everything beyond is up to you.

Feel free to ask questions about the ICU if you decide you need it, but feel free to explore the much simpler iconv() .

+8
04 Sep '11 at 10:27
source share

In general, we should use a different data type for Unicode characters.

For example, you can use a wide char data type

 wchar_t theString[] = L"你们好āa"; 

Notice the L modifier, which reports that the string consists of wide characters.

The length of this string can be calculated using the wcslen function, which behaves like strlen .

+2
Sep 04 2018-11-11T00:
source share

In the real world, theString[3]=foo; not a meaningful operation. Why would you ever want to replace a character at a certain position in a line with another character? Of course, there is no text-processing task for which this operation makes sense.

Character counting is also unlikely. How many characters (for your idea "character") are there in "á"? What about "á"? Now what about "གི"? If you need this information to implement any kind of text editing, you will have to deal with these difficult questions or just use the existing / gui library toolkit. I would recommend the latter if you are not a specialist in world scripts and languages, and think that you can do better.

For all other purposes, strlen tells you exactly the part of the information that is really useful: how much memory space the line occupies. This is what is needed to concatenate and split lines. If all you want to do is combine the strings or split them on a specific separator, snprintf (or strcat if you insist ...) and strstr are all you need.

If you want to perform operations in a natural language with a higher level, for example, capitalization, line breaks, etc. or even higher-level operations, such as pluralization, time change, etc., then you need a library similar to the ICU or, accordingly, something much higher level and linguistically capable (and specific to the language (s) with which you are working).

Again, most programs have no meaning for this kind of thing, and you just need to compile and parse the text without any consideration in natural language.

+2
Sep 04 2018-11-12T00:
source share
 while (s[i]) { if ((s[i] & 0xC0) != 0x80) j++; i++; } return (j); 

This will count the characters in the UTF-8 string ... (Found in this article: Even faster counting of UTF-8 characters )

However, am I still at a dead end to slice and concatenate?!?

+1
Sep 04 '11 at 8:27
source share

One thing that is not clear from the answers above is why it is not easy. Each character is encoded one way or another - for example, it should not be UTF-8, and each character can have several encodings with different ways of handling a combination of accents, etc. The rules are really complex and vary depending on the encoding (for example, utf-8 vs. utf-16).

This issue has huge security issues, so it needs to be done correctly. Use the library provided by the OS or a well-known third-party library for managing Unicode strings; do not collapse your own.

+1
Sep 04 2018-11-11T00:
source share

I have done similar years of implementation. But I have no code with me.

For each Unicode character, the first byte describes the number of bytes that follow it to create a Unicode character. Based on the first byte, you can determine the length of each Unicode character.

I think this is a good UTF8 library. enter the link here

0
Sep 06 '11 at 17:36
source share

A sequence of code points constitutes a single syllable / letter / symbol in many other non-European languages ​​(for example, all languages ​​with an index)

So, when you calculate the length OR find a substring (cases of finding substrings are definitely used - let's say we play the executioner's game), you need to promote the syllable by the syllable, and not by the code point by the code point.

Thus, the definition of a character / syllable and where you actually break the string into “pieces of syllables” depends on the nature of the language you are dealing with. For example, the syllable pattern in many indicator languages ​​(Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

 V (Vowel in their primary form appearing at the beginning of the word) C (consonant) C + V (consonant + vowel in their secondary form) C + C + V C + C + C + V 

You need to parse the string and find the patterns above to break the string and find the substrings.

I don’t think that you can use the general-purpose method, which can magically break lines in the manner described above for any Unicode string (or sequence of code points) - since a template that works for one language may not be applicable for another letter;

I suggest that there may be some methods / libraries that can take some definition / configuration parameters as input to break Unicode strings into such syllable syllables. Not sure though! Appreciate if someone can share how they solved this problem using any commercially available or open source methods.

-one
Oct 20
source share



All Articles