Will strcmp compare utf-8 strings in code point order?

In a C program, I want to sort the list of valid UTF-8 encoded strings in a Unicode code sequence. No comparison, no knowledge of linguistic value.

I need a comparison function. It's easy enough to write a function that iterates over Unicode characters. (I use GLib , so I repeat using g_utf8_next_char and compare the return values ​​of g_utf8_next_char .)

But what interests me, out of curiosity and perhaps simplicity and efficiency, is this: will a simple byte per byte strcmp (or g_strcmp ) do the same job? I think that since UTF-8 encodes the most important bits in the first place, and a code point that needs to be encoded in N + 1 bytes will have a larger start than a code point that should be encoded in N bytes.

But maybe I missed something? Thanks in advance.

+7
c unicode utf-8 glib
source share
1 answer

Yes, UTF-8 saves the code, so you can just use strcmp . This is one of the (many) beautiful points of UTF-8.

One caveat is that Unicode code points are UTF-32 values, and some people who talk about matching Unicode strings in “code point” order actually use the word “code point” to mean “code unit” UTF-16. " If you want the order to match the setting of the UTF-16 code block, a little more work is involved.

+7
source share

All Articles