What is the difference between UTF and UCS.
UCS encodings are a fixed width and are marked with the number of bytes for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range cannot be encoded in UCS.
UTF encodings are variable width and are marked with the minimum number of bits to preserve the character. For example, UTF-16 requires at least 16 bits (2 bytes) for each character. Characters with large code points are encoded using more bytes β 4 bytes for astral characters in UTF-16.
- Internal representation inside the code
- Best repository view (i.e. in file)
- The best wired transport format (transfer between applications, which may be on different architectures and another standard language)
For modern systems, the most reasonable storage and transport coding is UTF-8. There are special cases where others may be appropriate - UTF-7 for older mail servers, UTF-16 for poorly written text editors, but UTF-8 is the most common.
The preferred internal presentation will depend on your platform. On Windows, this is UTF-16. On UNIX, this is UCS-4. Each has its own good points:
- UTF-16 lines never use more memory than a UCS-4 line. If you store a lot of large strings with characters, mainly in the base multilingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside of BMP, he will use the same amount.
- UCS-4 is easier to reason about. Since UTF-16 characters can be split into several "surrogate pairs", it can be difficult to correctly split or display a string. UCS-4 text does not have this problem. UCS-4 is also very similar to ASCII text in "char" arrays, so existing text algorithms are easily portable.
Finally, some systems use UTF-8 as an internal format. This is good if you need to interact with existing ASCII or ISO-8859 based systems because NULL bytes are not present in the middle of UTF-8 text - they are in UTF-16 or UCS-4.
John millikin
source share