How to read UTF-8 characters with a pointer?

Suppose I have UTF-8 content stored in memory, how do I read characters with a pointer? I suppose I need to look at the 8th bit pointing to a multibyte character, but how exactly do I turn the sequence into a real Unicode character? Also, is wchar_t correct type for storing a single Unicode character?

Here is what I mean:

 wchar_t readNextChar (char*& p) { wchar_t unicodeChar; char ch = *p++; if ((ch & 128) != 0) { // This is a multi-byte character, what do I do now? // char chNext = *p++; // ... but how do I assemble the Unicode character? ... } ... return unicodeChar; }
wchar_t readNextChar (char*& p) { wchar_t unicodeChar; char ch = *p++; if ((ch & 128) != 0) { // This is a multi-byte character, what do I do now? // char chNext = *p++; // ... but how do I assemble the Unicode character? ... } ... return unicodeChar; } 
+6
c ++ unicode utf-8 character-encoding
source share
5 answers

You must decode the UTF-8 bit pattern in its unencrypted UTF-32 representation. If you want to use Unicode code, you must use a 32-bit data type.

On Windows, wchar_t NOT large enough since it is only 16 bits. Instead, you should use unsigned int or unsigned long . Use wchar_t only when working with UTF-16 codes.

On other platforms, wchar_t usually 32 bits. But when writing portable code, you should stay away from wchar_t , unless absolutely necessary (e.g. std::wstring ).

Try something else like this:

 #define IS_IN_RANGE(c, f, l) (((c) >= (f)) && ((c) <= (l))) u_long readNextChar (char* &p) { // TODO: since UTF-8 is a variable-length // encoding, you should pass in the input // buffer actual byte length so that you // can determine if a malformed UTF-8 // sequence would exceed the end of the buffer... u_char c1, c2, *ptr = (u_char*) p; u_long uc = 0; int seqlen; // int datalen = ... available length of p ...; /* if( datalen < 1 ) { // malformed data, do something !!! return (u_long) -1; } */ c1 = ptr[0]; if( (c1 & 0x80) == 0 ) { uc = (u_long) (c1 & 0x7F); seqlen = 1; } else if( (c1 & 0xE0) == 0xC0 ) { uc = (u_long) (c1 & 0x1F); seqlen = 2; } else if( (c1 & 0xF0) == 0xE0 ) { uc = (u_long) (c1 & 0x0F); seqlen = 3; } else if( (c1 & 0xF8) == 0xF0 ) { uc = (u_long) (c1 & 0x07); seqlen = 4; } else { // malformed data, do something !!! return (u_long) -1; } /* if( seqlen > datalen ) { // malformed data, do something !!! return (u_long) -1; } */ for(int i = 1; i < seqlen; ++i) { c1 = ptr[i]; if( (c1 & 0xC0) != 0x80 ) { // malformed data, do something !!! return (u_long) -1; } } switch( seqlen ) { case 2: { c1 = ptr[0]; if( !IS_IN_RANGE(c1, 0xC2, 0xDF) ) { // malformed data, do something !!! return (u_long) -1; } break; } case 3: { c1 = ptr[0]; c2 = ptr[1]; switch (c1) { case 0xE0: if (!IS_IN_RANGE(c2, 0xA0, 0xBF)) { // malformed data, do something !!! return (u_long) -1; } break; case 0xED: if (!IS_IN_RANGE(c2, 0x80, 0x9F)) { // malformed data, do something !!! return (u_long) -1; } break; default: if (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF)) { // malformed data, do something !!! return (u_long) -1; } break; } break; } case 4: { c1 = ptr[0]; c2 = ptr[1]; switch (c1) { case 0xF0: if (!IS_IN_RANGE(c2, 0x90, 0xBF)) { // malformed data, do something !!! return (u_long) -1; } break; case 0xF4: if (!IS_IN_RANGE(c2, 0x80, 0x8F)) { // malformed data, do something !!! return (u_long) -1; } break; default: if (!IS_IN_RANGE(c1, 0xF1, 0xF3)) { // malformed data, do something !!! return (u_long) -1; } break; } break; } } for(int i = 1; i < seqlen; ++i) { uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F)); } p += seqlen; return uc; } 
+7
source share

Here is a quick macro that will read UTF-8 bytes

 #define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1 

This will help you determine the size of the UTF-8 character for easier parsing.

+4
source share

If you need to decode UTF-8, you need to develop a UTF-8 parser. UTF-8 is a variable-length encoding (from 1 to 4 bytes), so you really need to write a parser that conforms to the standard: see wikipedia , for example.

If you do not want to write your own parser, I suggest using a library. You will find this in glib, for example (I personally used Glib :: ustring, C ++ is the glib shell), but also in any good general purpose library.

Edit:

I think C ++ 0x will include UTF-8 support, but I'm not a specialist ...

my2c

+2
source share

Also, is wchar_t the right type to hold a single Unicode character?

On Linux, yes. On Windows, wchar_t is a UTF-16 code block, which is not necessarily a character.

The upcoming C ++ 0x standard will provide the types char16_t and char32_t for representing UTF-16 and UTF-32.

If on a system where char32_t unavailable and wchar_t inadequate, use uint32_t to store Unicode characters.

+1
source share

This is my solution in pure ANSI-C, including unit test for corner cases.

Beware that int should be at least 32 bits. Otherwise, you must change the definition of codepoint .

 #include <assert.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> typedef unsigned char byte; typedef unsigned int codepoint; /** * Reads the next UTF-8-encoded character from the byte array ranging * from {@code *pstart} up to, but not including, {@code end}. If the * conversion succeeds, the {@code *pstart} iterator is advanced, * the codepoint is stored into {@code *pcp}, and the function returns * 0. Otherwise the conversion fails, {@code errno} is set to * {@code EILSEQ} and the function returns -1. */ int from_utf8(const byte **pstart, const byte *end, codepoint *pcp) { size_t len, i; codepoint cp, min; const byte *buf; buf = *pstart; if (buf == end) goto error; if (buf[0] < 0x80) { len = 1; min = 0; cp = buf[0]; } else if (buf[0] < 0xC0) { goto error; } else if (buf[0] < 0xE0) { len = 2; min = 1 << 7; cp = buf[0] & 0x1F; } else if (buf[0] < 0xF0) { len = 3; min = 1 << (5 + 6); cp = buf[0] & 0x0F; } else if (buf[0] < 0xF8) { len = 4; min = 1 << (4 + 6 + 6); cp = buf[0] & 0x07; } else { goto error; } if (buf + len > end) goto error; for (i = 1; i < len; i++) { if ((buf[i] & 0xC0) != 0x80) goto error; cp = (cp << 6) | (buf[i] & 0x3F); } if (cp < min) goto error; if (0xD800 <= cp && cp <= 0xDFFF) goto error; if (0x110000 <= cp) goto error; *pstart += len; *pcp = cp; return 0; error: errno = EILSEQ; return -1; } static void assert_valid(const byte **buf, const byte *end, codepoint expected) { codepoint cp; if (from_utf8(buf, end, &cp) == -1) { fprintf(stderr, "invalid unicode sequence for codepoint %u\n", expected); exit(EXIT_FAILURE); } if (cp != expected) { fprintf(stderr, "expected %u, got %u\n", expected, cp); exit(EXIT_FAILURE); } } static void assert_invalid(const char *name, const byte **buf, const byte *end) { const byte *p; codepoint cp; p = *buf + 1; if (from_utf8(&p, end, &cp) == 0) { fprintf(stderr, "unicode sequence \"%s\" unexpectedly converts to %#x.\n", name, cp); exit(EXIT_FAILURE); } *buf += (*buf)[0] + 1; } static const byte valid[] = { 0x00, /* first ASCII */ 0x7F, /* last ASCII */ 0xC2, 0x80, /* first two-byte */ 0xDF, 0xBF, /* last two-byte */ 0xE0, 0xA0, 0x80, /* first three-byte */ 0xED, 0x9F, 0xBF, /* last before surrogates */ 0xEE, 0x80, 0x80, /* first after surrogates */ 0xEF, 0xBF, 0xBF, /* last three-byte */ 0xF0, 0x90, 0x80, 0x80, /* first four-byte */ 0xF4, 0x8F, 0xBF, 0xBF /* last codepoint */ }; static const byte invalid[] = { 1, 0x80, 1, 0xC0, 1, 0xC1, 2, 0xC0, 0x80, 2, 0xC2, 0x00, 2, 0xC2, 0x7F, 2, 0xC2, 0xC0, 3, 0xE0, 0x80, 0x80, 3, 0xE0, 0x9F, 0xBF, 3, 0xED, 0xA0, 0x80, 3, 0xED, 0xBF, 0xBF, 4, 0xF0, 0x80, 0x80, 0x80, 4, 0xF0, 0x8F, 0xBF, 0xBF, 4, 0xF4, 0x90, 0x80, 0x80 }; int main() { const byte *p, *end; p = valid; end = valid + sizeof valid; assert_valid(&p, end, 0x000000); assert_valid(&p, end, 0x00007F); assert_valid(&p, end, 0x000080); assert_valid(&p, end, 0x0007FF); assert_valid(&p, end, 0x000800); assert_valid(&p, end, 0x00D7FF); assert_valid(&p, end, 0x00E000); assert_valid(&p, end, 0x00FFFF); assert_valid(&p, end, 0x010000); assert_valid(&p, end, 0x10FFFF); p = invalid; end = invalid + sizeof invalid; assert_invalid("80", &p, end); assert_invalid("C0", &p, end); assert_invalid("C1", &p, end); assert_invalid("C0 80", &p, end); assert_invalid("C2 00", &p, end); assert_invalid("C2 7F", &p, end); assert_invalid("C2 C0", &p, end); assert_invalid("E0 80 80", &p, end); assert_invalid("E0 9F BF", &p, end); assert_invalid("ED A0 80", &p, end); assert_invalid("ED BF BF", &p, end); assert_invalid("F0 80 80 80", &p, end); assert_invalid("F0 8F BF BF", &p, end); assert_invalid("F4 90 80 80", &p, end); return 0; } 
+1
source share

All Articles