How can I check the unicode string of Python to make sure that it * is actually * the correct Unicode?

Question

How can I check the unicode string of Python to make sure that it * is actually * the correct Unicode?

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

It all seems to have messed up as it decodes correctly, but when I try to save it in postgres, I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf

The database is hushed up after that and refuses to do anything without rollback, which will be a little complicated (long story). Is there a way to check if this happens before it gets to the database? source.encode ("utf-8") works without crashing, so I'm not sure what is happening ...

+7

python postgresql unicode

Stavros korokithakis Aug 15 '10 at 12:38

source share

5 answers

A Python unicode object is a sequence of Unicode code points and, by definition, native Unicode. The python str string is a sequence of bytes, which can be Unicode characters encoded with a specific encoding (UTF-8, Latin-1, Big5, ...).

First question: if source is a unicode object or str string. This source.encode("utf-8") works just means that you can convert source to a UTF-8 encoded string, but do you do this before passing it to the database functions? It seems that the database expects the inputs to be encoded using UTF-8 and complains that the equivalent source.decode("utf-8") .

If source is a unicode object, it must be encoded in UTF-8 before passing it to the database:

 source = u'abc' call_db(source.encode('utf-8'))

If source is str encoded as something other than Utf-8, you must decode this encoding and then encode the resulting Unicode object to UTF-8:

 source = 'abc' call_db(source.decode('Big5').encode('utf-8'))

+1

sth Aug 15 '10 at 12:58

source share

What exactly are you doing? The content is really perfectly decoded as utf-8 :

 >>> import urllib >>> webcontent = urllib.urlopen("http://hub.iis.sinica.edu.tw/cytoHubba/").read() >>> unicodecontent = webcontent.decode("utf-8") >>> type(webcontent) <type 'str'> >>> type(unicodecontent) <type 'unicode'> >>> type(unicodecontent.encode("utf-8")) <type 'str'>

Make sure you understand the difference between Unicode strings and utf-8 encoded strings. What you need to send to the database is unicodecontent.encode("utf-8") (this is the same as webcontent , but you have decrypted to make sure that you do not have invalid byte sequences in your source).

I would really say so WoLpH, checking the settings in the database and connecting to the database.

0

chryss Aug 15 '10 at 13:04

source share

In the end, I decided to just work around this, catch the error and cancel the transaction using Django's transaction management. I am puzzled by why this happens, though ...

0

Stavros korokithakis Aug 15 '10 at 13:29

source share

To solve my similar problems with django / postgress, I am now doing something like this

 class SafeTextField(models.TextField) def get_prep_value(self, value): encoded = base64.encodestring(value).strip() return super(SafeTextField, self).get_prep_value(encoded) def to_python(self, value): decoded = base64.decodestring(value) return super(SafeTextField, self).to_python(decoded)

0

thanos Aug 7 '12 at 18:02

source share

mikelikespie · Accepted Answer · 2010-08-18T09:51:07+0000

There is a bug in python 2.x that is only fixed python 3.x. In fact, this error is even in OS X iconv (but not in glibc).

Here's what happens:

Python 2.x does not recognize UTF8 [1] surrogate pairs as invalid (which is your character sequence)

This should be all that is needed:

 foo.decode('utf8').encode('utf8')

But thanks to this error, they do not fix it; it does not capture surrogate pairs.

Try this in python 2.x and then in 3.x:

 b'\xed\xbd\xbf'.decode('utf8')

It will output the error (correctly) to the last. They also do not commit it to branch 2.x. See [2] and [3] for more information.

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

How can I check the unicode string of Python to make sure that it * is actually * the correct Unicode?

More articles: