Is there any reason not to use UTF-8, 16, etc. For everything?

Question

Is there any reason not to use UTF-8, 16, etc. For everything?

I know that the network has mostly been standardized with respect to UTF-8 lately, and I'm just wondering if there was a place where using UTF-8 would be bad. I heard that UTF-8, 16, etc. They could use more space, but in the end it was insignificant.

Also, what about Windows programs, the Linux shell, and things of this nature - can you safely use UTF-8 there?

+7

character-encoding utf

Joe phillips Jan 15 '11 at 0:00

source share

3 answers

When you need to write a program (executing string manipulations) that should be very fast, and that you are sure that you do not need exotic characters, maybe UTF-8 is not a good idea. In any other situations, UTF-8 should be standard.

UTF-8 works well on almost every latest software, even on Windows.

0

Marc-François Jan 15 '11 at 0:05

source share

It is well known that utf-8 is best suited for file storage and network transport. But people are discussing whether it is better to use utf-16/32 for processing. One of the main arguments is that utf-16 is still variable length, and even utf-32 is still not one code point per character, so how are they better than utf-8? My opinion is this: utf-16 is a very good compromise.

Firstly, characters from the BMP side, which need double code points in utf-16, are rarely used. Chinese characters (also some other symbols of Asia) in this range are mostly dead. Ordinary people will not use them at all, except that experts use them to digitize ancient books. Thus, utf-32 will spend most of the time. Do not worry too much about these characters, as they will not make your software look bad if you do not handle them properly, if your software is not for these special users.

Secondly, often we need the distribution of the string to be related to the number of characters. for example, a database row column for 10 characters (provided that we store the unicode string in normalized form), which will be 20 bytes for utf-16. In most cases, it will work just like that, except in extreme cases, it will only contain 5-8 characters. But for utf-8, the total byte length of one character is 1-3 for Western languages and 3-5 for Asian languages. This means that we need 10-50 bytes, even for ordinary cases. More data, more processing.

0

Dudu Nov 14 '11 at 15:34

source share

foo · Accepted Answer · 2011-01-15T00:23:36+0000

If UTF-32 is available, prefer over other versions to handle.

If your platform supports Unicode UTF-32 / UCS-4 natively, then the “compressed” versions of UTF-8 and UTF-16 may be slower because they use a different number of bytes for each character (sequence of characters), which makes direct search in a string by index, while UTF-32 uses a 32-bit flat for each character, which greatly speeds up some string operations.

Of course, if you program in a very limited environment, for example, embedded systems, and you can be sure that there will be only ASCII or ISO 8859-x characters around, then you can choose these encodings for efficiency and speed, But in general, stick to Unicode conversion formats .

Is there any reason not to use UTF-8, 16, etc. For everything?

More articles: