Two apparently identical Python Unicode UTF8 encoded strings do not match

Question

Two apparently identical Python Unicode UTF8 encoded strings do not match

>>> str1 = unicode('María','utf8') >>> str2 = u'María'.encode('utf8') >>> str1 == str2 False

How is this possible?

Just in case, this is true, I use the iPython Notebook.

+1

python unicode utf-8

Eduardo martin Jun 27 '13 at 12:33

source share

2 answers

The string cannot be either "Unicode" or "UTF-8 encoded" ; they are mutually exclusive. Consequently, different lines.

+3

Ignacio Vazquez-Abrams Jun 27 '13 at 12:36

source share

Martijn pieters · Accepted Answer · 2013-06-27T12:36:00+0000

You have a unicode string and a byte string. This is not the same thing.

One is Unicode, María . The other contains the UTF-8 encoding in bytes, 'Mar\xc3\xada' .

Python 2 does an implicit conversion when comparing Unicode string values and bytes, but you should not rely on this conversion and is completely dependent on the standard codec installed for your system.

If you still don’t know what Unicode is, or why UTF-8 is not the same, or want to know something else about encodings, see:

Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) From Joel Spolsky
Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Two apparently identical Python Unicode UTF8 encoded strings do not match

More articles: