I find this infuriating, that I don’t understand it yet, but maybe some explanation will help. This is a two-part question, but we hope that both parts are small and directly related:
Display
We recently had a problem when the content inserted the characters U+00a0 (non-breaking space) into the database column encoded with latin1 . Just SELECT displays "Â" in the column. I'm not sure if this is a product of choice or display, but I believe that this is the first. SELECT BINARY col prints "" instead, and should also, since my shell has $LANG = en_US.utf8 .
A more notable example is “â” ¢ versus “™”
Using SELECT CONVERT(col USING utf8) still prints "Â" and "â" ¢ "- I would not expect it to be different, but where is the problem from? Is this a problem that occurs during storage? Is there a way get UTF8 mapping from the database instead of relying on the user interface to display it correctly (if that makes sense?)
storage
In an attempt to reproduce this problem myself, I did the following:
CREATE TABLE chrs ( lat varchar(255) charset latin1, utf varchar(255) charset utf8 ); INSERT INTO chrs VALUES ('™', '™'); INSERT INTO chrs VALUES (' ', ' ');
However, this leads to:
> SELECT * FROM chrs; +------+------+ | lat | utf | +------+------+ | ™ | ™ | | | | +------+------+
I would expect lat display "Â" and "â" ¢, so I don't understand what I don't understand.
What else does this mean:
> SELECT BINARY lat, BINARY utf FROM chrs; +------------+------------+ | BINARY lat | BINARY utf | +------------+------------+ | | ™ | | | | +------------+------------+
This means that the values are stored incorrectly (?) In lat .
I noticed that SELECT @@character_set_client was utf8 , so I changed it to latin1 and inserted a space again, but this gives
| Â | Â |
for both columns. SELECT BINARY lat displays the space correctly, but SELECT binary utf8 still prints “Â.” I would expect the utf8 column to work more correctly.
Summarizing:
- What does MySQL actually do for the characters when they are inserted? Does it depend on the encoding of the column, client set, and both?
- Is it possible to screw data during insertion due to inconsistency of the above? Or can you always restore the originally inserted data?
- What does a
charset in a column do regarding storage / display?