Using UTF in C ++ Code

What is the difference between UTF and UCS.

What are the best ways to represent non-European character sets (using UTF) in C ++ strings. I would like to know your recommendations for:

  • Internal representation inside the code
    • To control a string at runtime
    • To use a string to display.
  • Best repository view ( i.e. In file)
  • The best wired transport format (transfer between applications that may be on different architectures and have a different standard locale)
+6
c ++ unicode utf ucs locale
source share
5 answers

What is the difference between UTF and UCS.

UCS encodings are a fixed width and are marked with the number of bytes for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range cannot be encoded in UCS.

UTF encodings are variable width and are marked with the minimum number of bits to preserve the character. For example, UTF-16 requires at least 16 bits (2 bytes) for each character. Characters with large code points are encoded using more bytes β€” 4 bytes for astral characters in UTF-16.

  • Internal representation inside the code
  • Best repository view (i.e. in file)
  • The best wired transport format (transfer between applications, which may be on different architectures and another standard language)

For modern systems, the most reasonable storage and transport coding is UTF-8. There are special cases where others may be appropriate - UTF-7 for older mail servers, UTF-16 for poorly written text editors, but UTF-8 is the most common.

The preferred internal presentation will depend on your platform. On Windows, this is UTF-16. On UNIX, this is UCS-4. Each has its own good points:

  • UTF-16 lines never use more memory than a UCS-4 line. If you store a lot of large strings with characters, mainly in the base multilingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside of BMP, he will use the same amount.
  • UCS-4 is easier to reason about. Since UTF-16 characters can be split into several "surrogate pairs", it can be difficult to correctly split or display a string. UCS-4 text does not have this problem. UCS-4 is also very similar to ASCII text in "char" arrays, so existing text algorithms are easily portable.

Finally, some systems use UTF-8 as an internal format. This is good if you need to interact with existing ASCII or ISO-8859 based systems because NULL bytes are not present in the middle of UTF-8 text - they are in UTF-16 or UCS-4.

+8
source share
+3
source share

I would suggest:

  • For representation in wchar_t code or equivalent.
  • To represent the repository, UTF-8.
  • For wire representation, UTF-8.

The advantage of UTF-8 in storage and wiring is that the machine’s finiteness is not a factor. The advantage of using a fixed-size character, such as wchar_t in code, is that you can easily find out the length of a string without having to scan it.

+2
source share

UTC is coordinated universal time, not a character set (I have not found the UTC encoding).

For internal representation, you can use wchar_t for each character and std :: wstring for strings. They use exactly 2 bytes for each character, so searching and random access will be fast.

For storage, if most of the data is not ASCII (i.e. code> = 128), you can use UTF-16, which is almost the same as serialized wstring and wchar_t .

Since UTF-16 can be a little big or big, for wired transport, try converting it to UTF-8, which is architecture independent.

0
source share

In the internal representation inside the code, you better do this for both European and non-European characters:

\ uNNNN

Characters range from \ u0020 to \ u007E, and a few spaces (such as the end of a line) can be written as regular characters. Everything above \ u0080, if you write it like a normal character, then it will be compiled only on your code page (for example, β€œOK” in France, but in Russia, in Russia, in Russia, but in Japan, in order , in China, but in the USA, etc ..).

0
source share

All Articles