Yesterday, I first exported my Mysql database, and I found some very strange characters in the dump, for example:
INSERT INTO `piwik_archive_blob_2013_01` VALUES (15,'Actions_actions_url_6',1,'2013-01-17','2013-01-17',1,'2013-01-20 07:36:53','xuNM0ý/œ#&ÝÕ³\ZõNYpÊÀì#!üw7Hж}°ÀAáZoN*šgµ\'GWª[Yûðe¯57 ÃÁÆ7|Ÿ\'Ü%µDh©-EÛ^ËL±ÕÞtª k@ (,b±ßZ.ÒÃ6b²aiÓÍ)87[ïΜ,æya¥uÒ<|+íª7MNuïÝ¿8ñ%1Ʊ>ÚX');
My MySQL server version: 5.1.66-0 + squeeze1 (Debian). This database was created automatically using the Piwik setup script.
Here is what I tried to solve this problem:
# 1 First, I checked the encoding of the database.
> show table status;
The 26 tables have a utf8_general_ci collation, which sounded quite normal. I assumed that mysqldump was exported to a different encoding (latin1?), So I tried:
mysqldump -u user -p**** --all-databases --default-character-set=utf8 | gzip -9 > dump.sql.gz
Result = I still had the same weird characters.
Note) I later learned that the default character set for mysqldump is utf8, regardless of the default server character set. So --default-character-set=utf8 was useless.
# 2 Then I thought I could solve the problem by updating mysql conf. The initial conf was:
mysql> show variables like "%character%";show variables like "%collation%"; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ +----------------------+-------------------+ | Variable_name | Value | +----------------------+-------------------+ | collation_connection | latin1_swedish_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_swedish_ci | +----------------------+-------------------+
So, I updated /var/lib/mysql/my.cnf and I added:
[mysqld] init_connect='SET collation_connection = utf8_general_ci' init_connect='SET NAMES utf8' character-set-server=utf8 collation-server=utf8_general_ci default-character-set=utf8 default-collation=utf8_general_ci [mysqldump] default-character-set=utf8
Then
/etc/init.d/mysql restart mysql> show variables like "%character%";show variables like "%collation%"; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ +----------------------+-------------------+ | Variable_name | Value | +----------------------+-------------------+ | collation_connection | utf8_general_ci | | collation_database | latin1_swedish_ci | | collation_server | utf8_general_ci | +----------------------+-------------------+
Result = same weird characters.
# 3 I changed character_set_database and collation_database :
mysql> ALTER DATABASE piwik default character SET utf8 collate utf8_general_ci; mysql> show variables like "%character%";show variables like "%collation%"; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ +----------------------+-----------------+ | Variable_name | Value | +----------------------+-----------------+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +----------------------+-----------------+
Result = same weird characters.
# 4 Now I realized that I had to change the default character set in MySQL (latin1) to utf8 before , creating the database.
Matching utf8_general_ci (# 1) means that the data is stored in utf8. However, is it possible that mysqldump believes that the data is stored in latin1 and encodes the data in utf8? This would mean that in the end the data is double utf8 encoded (sigh). In this case, how can I fix the problem?
Thank you for your help.
ps) I wonder why Piwik does not require changing the default encoding of the database in utf8.