Is there strlen () that works with char16_t?

As stated in this question:

typedef __CHAR16_TYPE__ char16_t; int main(void) { static char16_t test[] = u"Hello World!\n"; printf("Length = %d", strlen(test)); // strlen equivalent for char16_t ??? return 0; } 

I searched and found only C ++ solutions.

My compiler is GCC 4.7 .

Edit:

To clarify, I was looking for a solution that returns a code points counter, not a characters counter.

These two options are different for UTF-16 strings containing characters outside of BMP .

+7
source share
4 answers

Here is your main level:

 int strlen16(const char16_t* strarg) { int count = 0; if(!strarg) return -1; //strarg is NULL pointer char16_t* str = strarg; while(*str) { count++; str++; } return count; } 

Here's a more efficient and popular line:

 int strlen16(const char16_t* strarg) { if(!strarg) return -1; //strarg is NULL pointer char16_t* str = strarg; for(;*str;++str) ; // empty body return str-strarg; } 

Hope this helps.

Warning: This does not work properly when counting characters (not code points) of a UTF-16 string. This is especially true when __STDC_UTF_16__ is defined as 1 .

UTF-16 is a variable length (2 bytes per character in BMP or 4 bytes per character outside BMP), and this does not apply to these functions.

+4
source
 #include <string.h> #include <wchar.h> #include <uchar.h> #define char8_t char #define strlen8 strlen #define strlen16 strlen16 #define strlen32(s) wcslen((const wchar_t*)s) static inline size_t strlen16(register const char16_t * string) { if (!string) return 0; register size_t len = 0; while(string[len++]); return len; } 

You should expect the number of char16_t characters to be returned, as opposed to the number of bytes.

Optimized 32-bit Intel Atom build view:

gcc -Wpedantic -std=iso9899:2011 -g3 -O2 -MMD -faggressive-loop-optimizations -fkeep-inline-functions -march=atom -mtune=atom -fomit-frame-pointer -mssse3 -mieee-fp -mfpmath=sse -fexcess-precision=fast -mpush-args -mhard-float -fPIC ...

 .Ltext0: .p2align 4,,15 .type strlen16, @function strlen16: .LFB20: .cfi_startproc .LVL0: mov edx, DWORD PTR 4[esp] xor eax, eax test edx, edx je .L4 .p2align 4,,15 .L3: .LVL1: lea eax, 1[eax] .LVL2: cmp WORD PTR -2[edx+eax*2], 0 jne .L3 ret .LVL3: .p2align 4,,7 .p2align 3 .L4: ret .cfi_endproc .LFE20: .size strlen16, .-strlen16 

Here's Intel parsing:

 static inline size_t strlen16(register const char16_t * string) { 0: 8b 54 24 04 mov edx,DWORD PTR [esp+0x4] if (!string) return 0; 4: 31 c0 xor eax,eax 6: 85 d2 test edx,edx 8: 74 16 je 20 <strlen16+0x20> a: 8d b6 00 00 00 00 lea esi,[esi+0x0] register size_t len = 0; while(string[len++]); 10: 8d 40 01 lea eax,[eax+0x1] 13: 66 83 7c 42 fe 00 cmp WORD PTR [edx+eax*2-0x2],0x0 19: 75 f5 jne 10 <strlen16+0x10> 1b: c3 ret 1c: 8d 74 26 00 lea esi,[esi+eiz*1+0x0] return len; } 20: c3 ret 21: eb 0d jmp 30 <AnonymousFunction0> 23: 90 nop 24: 90 nop 25: 90 nop 26: 90 nop 27: 90 nop 28: 90 nop 29: 90 nop 2a: 90 nop 2b: 90 nop 2c: 90 nop 2d: 90 nop 2e: 90 nop 2f: 90 nop 
+2
source

You need to read 2 bytes and check if both of them are zeros, since in Unicode the first byte can be zero.

Not an ideal solution (actually some weird solution):

 size_t strlen16(const char16_t* str16) { size_t result = 0; char* strptr = (char*) str16; char byte0, byte1; if(str16 == NULL) return result; byte0 = *strptr; byte1 = *(strptr + 1); while(byte0|byte1) { strptr += 2; byte0 = *strptr; byte1 = *(strptr + 1); result++; } return result; } 
0
source

There is wcslen() Windows.

Regardless of the platform, it is best not to use char16_t. I believe that it is a mistake on the part of the standard committee to have it in the language.

0
source