Ruby, problems matching strings with UTF-8 characters

Question

Ruby, problems matching strings with UTF-8 characters

I have these two lines of UTF-8:

a = "N\u01b0\u0303" b = "N\u1eef"

They look completely different, but they are the same after rendering them:

 irb(main):039:0> puts "#{a} - #{b}" Nữ - Nữ

The version that I saved in the database. Version b is the one that comes from the browser in the POST request, I don’t know why the browser sends another combination of UTF8 characters, and this does not always happen, I can not reproduce the problem in my dev environment, it happens in production and in percent of the total number of requests.

The fact is that I'm trying to compare both of them, but they return false :

 irb(main):035:0> a == b => false

I tried different things like force coding:

 irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8") => false

Another interesting fact:

 irb(main):005:0> a.chars => ["N", "ư", "̃"] irb(main):006:0> b.chars => ["N", "ữ"]

How can I compare these lines?

+6

ruby unicode ruby-on-rails-3 utf-8 character-encoding

fguillen Nov 24 '15 at 14:52

source share

2 answers

You can see that these are different characters. First and second . In the first case, it uses the " unite tilde " modifier.

Wikipedia has a section on this subject:

It is assumed that sequences of code points that are defined as canonically equivalent have the same appearance and meaning when printed or displayed. For example, a code point U + 006E (Latin lowercase "n") followed by U + 0303 (combined tilde "◌") is defined by Unicode as canonically equivalent to a single code point U + 00F1 (lowercase letter "-" of the Spanish alphabet ) Therefore, these sequences should be displayed in the same way, they should be treated in the same way using applications such as alphabet names or search, and can be replaced with each other.

and

The standard also defines a text normalization procedure called Unicode normalization, which replaces equivalent character sequences, so that any two equivalent text will be reduced to the same code point sequence, called the normalization form or the normal form of the source text.

Ruby seems to support this normalization, but only with Ruby 2.2 :

http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html

 a = "N\u01b0\u0303".unicode_normalize b = "N\u1eef".unicode_normalize a == b # true

Alternatively, if you use Ruby on Rails, there is a built-in method to normalize.

+3

Martin Konecny Nov 24 '15 at 15:44

source share

matt · Accepted Answer · 2015-11-24T15:41:19+0000

This is a problem with the Unicode equivalent .

The version of your string a consists of the character ư (U + 01B0: LATIN SMALL LETTER U WITH HORN) and then U + 0303 COMBINING TILDE. This second character, as the name implies, is a combination of a character that, when rendered, is combined with the previous character to create the final glyph.

The version of string b uses the character ữ (U + 1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE), which is the only character and is equivalent to the previous combination, but uses a different byte to represent it.

To compare these strings, you need to normalize them so that they use the same byte sequences for these character types. Current versions of Ruby have a built-in (in earlier versions you had to use a third-party library).

So, now you have

 a == b

which is false but if you do

 a.unicode_normalize == b.unicode_normalize

you should get true .

If you are using an older version of Ruby, there are several options. Rails has a normalize method as part of its multi-byte support, so if you use Rails you can do:

 a.mb_chars.normalize == b.mb_chars.normalize

or maybe something like:

 ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)

If you are not using Rails, you can look at the unicode_utils gem and do something like this:

 UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)

( nfkc refers to the normalization form; this is the same as the default value in other methods.)

There are various ways to normalize Unicode strings (i.e. whether you use decomposed or combined versions), and this example just uses the default value. I am leaving the study of differences to you.

Ruby, problems matching strings with UTF-8 characters

More articles: