An explanation of how encoding affects storage / display

Question

An explanation of how encoding affects storage / display

I find this infuriating, that I don’t understand it yet, but maybe some explanation will help. This is a two-part question, but we hope that both parts are small and directly related:

Display

We recently had a problem when the content inserted the characters U+00a0 (non-breaking space) into the database column encoded with latin1 . Just SELECT displays "Â" in the column. I'm not sure if this is a product of choice or display, but I believe that this is the first. SELECT BINARY col prints "" instead, and should also, since my shell has $LANG = en_US.utf8 .

A more notable example is “â” ¢ versus “™”

Using SELECT CONVERT(col USING utf8) still prints "Â" and "â" ¢ "- I would not expect it to be different, but where is the problem from? Is this a problem that occurs during storage? Is there a way get UTF8 mapping from the database instead of relying on the user interface to display it correctly (if that makes sense?)

storage

In an attempt to reproduce this problem myself, I did the following:

 CREATE TABLE chrs ( lat varchar(255) charset latin1, utf varchar(255) charset utf8 ); INSERT INTO chrs VALUES ('™', '™'); INSERT INTO chrs VALUES (' ', ' '); -- U+00a0

However, this leads to:

 > SELECT * FROM chrs; +------+------+ | lat | utf | +------+------+ | ™ | ™ | |  |  | +------+------+

I would expect lat display "Â" and "â" ¢, so I don't understand what I don't understand.

What else does this mean:

  > SELECT BINARY lat, BINARY utf FROM chrs; +------------+------------+ | BINARY lat | BINARY utf | +------------+------------+ |   | ™ | |   |  | +------------+------------+

This means that the values are stored incorrectly (?) In lat .

I noticed that SELECT @@character_set_client was utf8 , so I changed it to latin1 and inserted a space again, but this gives

 | Â | Â |

for both columns. SELECT BINARY lat displays the space correctly, but SELECT binary utf8 still prints “Â.” I would expect the utf8 column to work more correctly.

Summarizing:

What does MySQL actually do for the characters when they are inserted? Does it depend on the encoding of the column, client set, and both?
Is it possible to screw data during insertion due to inconsistency of the above? Or can you always restore the originally inserted data?
What does a charset in a column do regarding storage / display?

+4

mysql character-encoding

Explosion pills Apr 25 '13 at 17:16

source share

2 answers

minopret · Answer 1 · 2013-04-25T17:47:32+0000

In short, your database seems to be in order, unless you explicitly say that it behaves strangely, changing [@@ character_set_client] from [utf8] to [latin1]. Otherwise, I think you see the effects of disagreement elsewhere between software components using UTF-8 and Windows-1252.

How do we understand what is happening?

First, recall that in MySQL latin1 really means Windows-1252 , the encoding is slightly different from "Latin-1" which is also known as ISO / IEC 8859-1.

Now consider the following data regarding a trademark mark and inextricable space:

Character: trademark mark
Unicode point: U + 2122
UTF-8 hexadecimal bytes: E2 84 A2
Latin-1 (ISO 8859-1) hexadecimal byte: there is no code for this character in this encoding
Windows 1252 Hexadecimal Byte: 8D
Character: "inextricable space"
Unicode point: U + 00A0
UTF-8 hexadecimal bytes: C2 A0
Latin-1 (ISO 8859-1) hexadecimal byte: A0
Windows 1252 Hexadecimal Byte: A0

Different ways: everything goes wrong:

Symbols resulting from the interpretation of the trademark UTF-8 hexadecimal bytes as Windows 1252 bytes: â "¢
- "Latin small letter a with a bypass layer", "double quotation mark with a low value -9", the sign "cent"
- Note. Latin-1 and Unicode generally do not have decoding for the 84-bit byte 84, which Windows-1252 defines as a "double quote with a low value of 9". Unicode encodes a "double quotation mark with a low value of 9" in the remote code point U + 201E.
Characters resulting from the interpretation of UTF-8 non-breaking space hexadecimal bytes in the form of windows 1252 bytes: Â [non-breaking space]
- latin capital letter a with envelope, inextricable space
Symbols resulting from the interpretation of a trademark character Windows-1252 hexadecimal byte as UTF-8 bytes: [no character: the character of the missing platform character is displayed, usually this is a change in the question mark]

It looks like when pasting, your database stores the trademark mark in “latin1” as 8D hexadecimal byte and in “UTF-8” as hexadecimal bytes E2 A4 A2. It stores non-breaking space in "latin1" as the hexadecimal byte of "A0" and in UTF-8 as hexadecimal bytes of C2 A0. When you make a normal SELECT in interactive mode, the sign “latin1” is translated first into UIC 212 Unicode, and then into hexadecimal bytes of UTF-8 E2 84 A2, which finally can be misinterpreted as if they were Windows bytes 1252.

Where to find the above character data:

svidgen · Answer 2 · 2013-04-25T18:53:43+0000

If each character transmission in the string is designated UTF8, the character must be stored as 3 bytes in the UTF8 field, of which the hexadecimal is:

E284A2

And, in the latan1 field, as 1 byte, whose hexadecimal code is:

However, your client and connection play a key role in properly storing the character and displaying it as saved.

Connection with client latin1 through connection latin1 , I created and INSERT both lines. Changed client / connection utf8 and reinserted. The result is as follows:

Choosing from my latin1 connection:

 mysql> select *, hex(lat), hex(utf) from chrs; +------+------+----------+----------------+ | lat | utf | hex(lat) | hex(utf) | +------+------+----------+----------------+ | ™ | ™ | E284A2 | C3A2E2809EC2A2 | | | | 20 | 20 | | ? | ? | 99 | E284A2 | | | | 20 | 20 | +------+------+----------+----------------+

The choice from my utf8 connection:

 mysql> select *, hex(lat), hex(utf) from chrs; +---------+---------+----------+----------------+ | lat | utf | hex(lat) | hex(utf) | +---------+---------+----------+----------------+ | â„¢ | â„¢ | E284A2 | C3A2E2809EC2A2 | | | | 20 | 20 | | ™ | ™ | 99 | E284A2 | | | | 20 | 20 | +---------+---------+----------+----------------+

The most confusing behavior here, in my opinion, is that C3A2E2809EC2A2 somehow correctly displays when SELECTed from the latin1 client and the connection. But, bearing in mind that the UTF8 field, MySQL, without a doubt, converts each set of 3 bytes into the corresponding Latin bit for transmission, thus sending E284A2 over the connection. And my terminal just interprets these three bytes as UTF8. (But this is more of a hunch. I'm not quite sure at what point the "inadvertently correct" conversion is taking place here.)

And, of course, MySQL kindly handles Latin 99 similar, but opposite way.

An explanation of how encoding affects storage / display

Display

storage

Summarizing:

More articles: