What happens to existing data if I change the column sorting in MySQL?

I am running a production application with a MySQL database server. I forgot to set the column mapping from latin to utf8_unicode , which leads to the appearance of strange data when stored in a column with multilingual data.

My question is what will happen to my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt existing data or will the data remain, but will the new data be saved as utf8 , as it should be?

I will change with phpMyAdmin web client.

+7
source share
5 answers

Running a quick test in MySQL 5.1 using the VARCHAR column set to latin1_bin . I inserted some non-latin characters

 INSERT INTO Test VALUES ('θ‹±εœ‹θ―εƒ‘'); 

I select them and get garbage (as expected).

 SELECT text from Test; 

gives

 text ???? 

Then I changed the column sorting to utf8_unicode and re-run SELECT and it showed the same result

 text ???? 

This is what I would expect - it will save the data, and the data will remain garbage, because when the data was inserted, the column lost additional information about the characters and just inserted it? for every non-latin character and there is no way for ???? become θ‹±εœ‹ 華僑 again.

Your data will remain in place, but it will not be corrected.

+4
source

The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this in detail, and also shows what will happen.

Note that you are mixing CHARACTER SET (actually an encoding) with COLLATION.

A character set defines the physical representation of a string in bytes on disk. You can make this visible using the HEX () function, for example SELECT HEX(str) FROM t WHERE id = 1 , to find out how MySQL stores the bytes of your string. What MySQL provides you with may vary depending on the character set of your connection defined with SET NAMES ....

Sort - sort order. It depends on the character set. For example, your data may be in the latin1 character set, but it can be ordered according to either of the two sorting orders of the German latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts like ΓΆ will either sort as oe or o.

When you change the character set, the data in the table must be overwritten. MySQL will read all the data and all indexes in the table, make a blind copy of the table, which temporarily takes up disk space, and then moves the old table to a hidden place, moves the hidden table in place and then deletes the old data, freeing up disk space. For some time between them you will need two times for storage.

When you change the sort order, the data sort order changes, but not the data itself. If the column you are editing is not part of the index, you do not need to do anything other than overwrite the frm file, and the latest versions of MySQL should not do enough anymore.

When changing the sorting of a column that is part of an index, the index needs to be rewritten, since the index is a sorted table excerpt. This will again invoke the ALTER TABLE table copy logic described above.

MySQL tries to save data by doing this: as long as the data that you have can be represented in the target character set, the conversion will not be lost. Warnings will be printed if data truncation occurs, and data that cannot be represented in the target character set will be replaced by?

+6
source

Valid data will be correctly converted:

When changing a data type using EDIT or EDIT, MySQL tries to convert existing column values ​​to a new type, and it is also possible. Warning: This conversion may lead to data changes.

http://dev.mysql.com/doc/refman/5.5/en/alter-table.html

... and more specifically:

To convert a binary or non-binary string to use a specific character set, use ALTER TABLE. For a successful conversion, one of the following conditions must apply: [...] If a column has an invariant data type (CHAR, VARCHAR, TEXT), its contents should be encoded in the character set of the column, and not in some other character set. If the contents are encoded with another set character, you can convert the column to use a binary data type first, and then a non-binary column with the desired character set.

http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html

So your problem is invalid data, such as data encoded in a different character set. I tried the hint suggested in the documentation and it basically ruined my data, but the reason is that my data is already lost: running SELECT column, HEX(column) FROM table showed that multibyte characters were inserted as 0x3F (i.e. . ? Latin character 1). My MySQL stack was smart enough to detect that the input was not Latin1 and converted it to something "compatible." And as soon as the data disappears, you cannot return it.

Summarizing:

  • Use HEX () to find out if you have all your data.
  • Make your tests in a copy of the table.
+1
source

My question is: what happens to my existing data if I change my mapping to utf8_unicode now?

Answer. If you switch to utf8_unicode_ci, it will not happen to your existing data (which is already corrupted and remains corrupted until you change it).

Will it destroy or distort existing data or will the data remain, but will the new data be saved as utf8, as it should be?

Answer. After you switch to utf8_unicode_ci, existing data will not be destroyed. It will remain the same as before (something like ????). However, if you insert new data containing Unicode characters, it will be saved correctly.

I will change with phpMyAdmin web client.

Answer. Of course, you can change the sorting using phpMyAdmin by going to "Operations"> "Table Options"

0
source

ATTENTION! Some problems are solved with

 ALTER TABLE ... CONVERT TO ... 

Some of them are solved using a two-step process.

 ALTER TABLE ... MODIFY ... VARBINARY... ALTER TABLE ... MODIFY ... VARCHAR... 

If you do the wrong thing, you will have an even bigger mess!

0
source

All Articles