What is better for PHP developers - Unicode or UTF-8?

What is better for PHP developers - Unicode or UTF-8?

I am going to create an international CMS. Therefore, I will have customers all over the world. They will speak all possible languages.

What encoding format is better for browser recognition and for storing database data?

+6
php encoding unicode utf-8
source share
5 answers

Unicode is not an encoding. You can mean UTF-8 vs UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find that UTF-8 is the most economical for your database.

+11
source share

UTF-8 is a Unicode encoding, a way of representing a (abstract) sequence of Unicode characters as a (specific) sequence of bytes. There are other encodings, such as UTF-16 (which have both high and low order options). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages, regardless of which one you choose.

UTF-8 is useful if most of your text is in Western languages, since it represents ASCII characters in just one byte, but for many characters, for a character of a foreign alphabet such as Chinese, three bytes are required for each character. UTF-16, on the other hand, uses exactly two bytes for all the characters you are likely to encounter (although some very esoteric characters outside of the Unicode "Basic Multilingual Plane" require four).

I would not recommend using PHP to develop international software, because it really does not support Unicode. It has some additional functions for working with Unicode encodings (look at a multibyte string ), but the PHP core treats strings as bytes, not characters, so standard PHP string functions are not suitable for working with characters that are encoded as more than one byte . For example, if you call PHP strlen() on a string containing the UTF-8 representation of the "大" character, it will return 3 because this character takes up three bytes in UTF-8, although this is only one character. Using line break functions such as substr() is unstable because if you split the middle of a multibyte character, you will damage the string.

Most of the other languages ​​used for web development, such as Java, C # and Python, have built-in Unicode support, so you can put arbitrary Unicode characters in a string and do not have to worry about what encoding is used to represent them in memory, because that from your point of view, the string contains characters, not bytes. This is a much safer, less error prone way to work with Unicode text. For this and other reasons (PHP is actually not such a wonderful language), I would recommend using something else.

(I read that PHP 6 will have proper Unicode support, but this is not yet available.)

+6
source share

UTF-8 is Unicode encoded. You probably meant that you wanted to choose between UTF-8 and UTF-16.

Microsoft recommends that

Developers should use UTF-8 for all Unicode data that they send and receive from the browser.

To store the database, use the encoding your RDBMS handles best. Or, ceteris paribus, choose based on space efficiency. UTF-8 is less for English and most European languages, while UTF-16 tends to be less for Asian languages.

+3
source share

Unicode is a standard that defines a bunch of abstract characters (the so-called code points) and their properties (this is a digit, uppercase, etc.). It also defines specific encodings (methods for representing characters with bytes), one of which is UTF-8. See Absolute Minimum. Every software developer Absolutely, should know positively about Spolsky 's Unicode and Character Sets (No Excuses!) For more details.

Of course, I would go with UTF-8, this is the standard everywhere these days, and has some nice features, such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions like htmlspecialchars can used directly in the UTF-8 view, so you’re less likely to leave the encoding security holes. In addition, many PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives such as UTF-16.

+3
source share

It is better to use UTF-8 because it applies to all language accents all over the world. UTF-8 also has advanced provisions to add unused or recognized characters. I prefer and always use UTF-8 and its series.

0
source share

All Articles