Clarification of potential equality issues for accented characters with non-binary comparisons

Question

Clarification of potential equality issues for accented characters with non-binary comparisons

For a website with international support, I use utf8mb4 charset and utf8mb4_unicode_ci to sort in most tables and columns. Performance is not paramount, and accurate sorting in different languages is important.

I understand how the utf8mb4_general_ci and utf8mb4_unicode_ci sorts work with comparisons in general with accented characters, namely:

SELECT column FROM table WHERE column='abad';

Would return "abad" and "abád"

While studying utf8 support in MySQL, I ran into an alleged problem with nonfunctional utf8___ calculations. The page http://mzsanford.com/blog/mysql-and-unicode/ describes the problem that the changes are not saved in some updates. He says: “When updating a record, it seems that MySQL (or at least InnoDB) checks for equality before updating the record. Since conversion only with accent is considered comparable, MySQL matches the record (which saves I / O overhead) and returns success. as he believes that he optimized the record, not failure. '

I interpret this as: if you tried to update a record that only changes the accents of a field, it will not be updated properly (since MySQL considers it to already match). But I could not reproduce it. I created a simple test case:

 CREATE DATABASE test_utf8 CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; USE test_utf8; CREATE TABLE test ( id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, text VARCHAR(300) NOT NULL, PRIMARY KEY (id) ) ENGINE = INNODB; INSERT INTO test (text) VALUES ('abád'); UPDATE test SET text='abad' WHERE id=1;

However, this correctly updates the value (despite only changing the emphasis by one character). Perhaps this was just a problem in an older version of MySQL? Or does this question arise in slightly different circumstances?

I would also appreciate if you have a moment to read some of my notes on several concepts around the subject and see if I have any misconceptions. If this is unmistakable, perhaps this will be useful information for someone.

The MySQL utf8 character set does not support true utf8 support, as characters are only 1-3 bytes. For true utf8 support, you probably want to use utf8mb4.

In general, utf8mb4_unicode_ci will be more accurate with appropriate language sorting, but there is a slight performance hit rather than using utf8mb4_general_ci.

If some columns never need to be sorted and will use comparison / equality checks, you should use utf8mb4_bin since it will be a little faster.

Accented characters are considered equal in both utf8mb4_general_ci and utf8mb4_unicode_ci commands. Because of this, this is a poor sorting choice for columns that must have unique values (e.g. primary keys). In this case, use utf8mb4_bin. And if the field should be emphasized for uniqueness, but at some point it should be sorted by language, it can be saved as utf8mb4_bin, and you can use the collate clause in the request when ordering. Example:

 SELECT column FROM table ORDER BY column COLLATE utf8mb4_unicode_ci;

This will cause the ordering to be sorted by language, despite its internal storage in binary sorting. This will affect performance because field mapping determines how it is indexed. The difference in query performance will be similar to the difference in performance when sorting a column without indexing and an indexed column.

By default, searches under utf8mb4_unicode_ci or utf8mb4_general_ci will not be accented, so a search for "abad" will return "abad" and "abád". Therefore, if you want to search with an accent, you need to either set the column mapping in utf8mb4_binary (if all queries are sensitive to accents), or use the matching clause in the query (if you want most queries to be accurate). Since the utf8mb4_bin collation is case sensitive, you will also need to modify the query if you want case insensitive, but with an accent. For example (suppose your search query is already lowercase in the server side scripting language):

 (Assuming the data is stored with a collation of utf8mb4_bin) SELECT column FROM table WHERE LOWERCASE(column) LIKE 'abád'; (Assuming the data is stored with a collation of utf8mb4_unicode_ci) SELECT column FROM table WHERE LOWERCASE(column) LIKE 'abád' COLLATE utf8mb4_bin;

Also, from the MySQL documentation (just by including it for others): when comparing values from different columns, declare those columns with the same character set and sort where possible, to avoid string conversion during query execution.

+6

mysql collation non-ascii-characters

dnag Feb 20 '14 at 21:05

source share

1 answer

Christopher mcgowan · Answer 1 · 2015-04-14T05:14:04+0000

I am not an expert, but I tried what you did with some add-ons ...

I ran your setup and the following in MySQL 5.6.17:

 SELECT COUNT(*) FROM test WHERE `text`='abad'; SELECT COUNT(*) FROM test WHERE `text`='abád'; UPDATE test SET text='abád' WHERE id=1;

It selects and returns 1 line, as expected, and an update (for example, your update) changes 1 line, which corresponds to the proposal in the blog.

I thought it might be a lower level optimization, but I noticed something interesting when I tried to run it again in the command line client (instead of Workbench):

 mysql> SELECT COUNT(*) FROM test WHERE `text`='abád'; ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' mysql> UPDATE test SET text='abád' WHERE id=1; ERROR 1366 (HY000): Incorrect string value: '\xA0d' for column 'text' at row 1

So, I ran this to see what happens:

 mysql> SELECT collation('abád'); +-------------------+ | collation('abád') | +-------------------+ | utf8_general_ci | +-------------------+ 1 row in set (0.00 sec)

There must be some kind of coercion due to my many sessions ... so I tried to explicitly match:

 UPDATE test SET text='abad' COLLATE utf8_unicode_ci WHERE id=1; UPDATE test SET text='abád' COLLATE utf8_unicode_ci WHERE id=1;

And yet I got the same results (updated both times).

I am currently left with my hunch that InnoDB optimization is performed at a lower level than SELECTing versus text criteria.

Clarification of potential equality issues for accented characters with non-binary comparisons

More articles: