It is well known that utf-8 is best suited for file storage and network transport. But people are discussing whether it is better to use utf-16/32 for processing. One of the main arguments is that utf-16 is still variable length, and even utf-32 is still not one code point per character, so how are they better than utf-8? My opinion is this: utf-16 is a very good compromise.
Firstly, characters from the BMP side, which need double code points in utf-16, are rarely used. Chinese characters (also some other symbols of Asia) in this range are mostly dead. Ordinary people will not use them at all, except that experts use them to digitize ancient books. Thus, utf-32 will spend most of the time. Do not worry too much about these characters, as they will not make your software look bad if you do not handle them properly, if your software is not for these special users.
Secondly, often we need the distribution of the string to be related to the number of characters. for example, a database row column for 10 characters (provided that we store the unicode string in normalized form), which will be 20 bytes for utf-16. In most cases, it will work just like that, except in extreme cases, it will only contain 5-8 characters. But for utf-8, the total byte length of one character is 1-3 for Western languages and 3-5 for Asian languages. This means that we need 10-50 bytes, even for ordinary cases. More data, more processing.
Dudu
source share