How can I check the unicode string of Python to make sure that it * is actually * the correct Unicode?

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

It all seems to have messed up as it decodes correctly, but when I try to save it in postgres, I get:

DatabaseError: invalid byte sequence for encoding "UTF8": 0xedbdbf 

The database is hushed up after that and refuses to do anything without rollback, which will be a little complicated (long story). Is there a way to check if this happens before it gets to the database? source.encode ("utf-8") works without crashing, so I'm not sure what is happening ...

+7
python postgresql unicode
source share
5 answers

There is a bug in python 2.x that is only fixed python 3.x. In fact, this error is even in OS X iconv (but not in glibc).

Here's what happens:

Python 2.x does not recognize UTF8 [1] surrogate pairs as invalid (which is your character sequence)

This should be all that is needed:

 foo.decode('utf8').encode('utf8') 

But thanks to this error, they do not fix it; it does not capture surrogate pairs.

Try this in python 2.x and then in 3.x:

 b'\xed\xbd\xbf'.decode('utf8') 

It will output the error (correctly) to the last. They also do not commit it to branch 2.x. See [2] and [3] for more information.

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209

+9
source share

A Python unicode object is a sequence of Unicode code points and, by definition, native Unicode. The python str string is a sequence of bytes, which can be Unicode characters encoded with a specific encoding (UTF-8, Latin-1, Big5, ...).

First question: if source is a unicode object or str string. This source.encode("utf-8") works just means that you can convert source to a UTF-8 encoded string, but do you do this before passing it to the database functions? It seems that the database expects the inputs to be encoded using UTF-8 and complains that the equivalent source.decode("utf-8") .

If source is a unicode object, it must be encoded in UTF-8 before passing it to the database:

 source = u'abc' call_db(source.encode('utf-8')) 

If source is str encoded as something other than Utf-8, you must decode this encoding and then encode the resulting Unicode object to UTF-8:

 source = 'abc' call_db(source.decode('Big5').encode('utf-8')) 
+1
source share

What exactly are you doing? The content is really perfectly decoded as utf-8 :

 >>> import urllib >>> webcontent = urllib.urlopen("http://hub.iis.sinica.edu.tw/cytoHubba/").read() >>> unicodecontent = webcontent.decode("utf-8") >>> type(webcontent) <type 'str'> >>> type(unicodecontent) <type 'unicode'> >>> type(unicodecontent.encode("utf-8")) <type 'str'> 

Make sure you understand the difference between Unicode strings and utf-8 encoded strings. What you need to send to the database is unicodecontent.encode("utf-8") (this is the same as webcontent , but you have decrypted to make sure that you do not have invalid byte sequences in your source).

I would really say so WoLpH, checking the settings in the database and connecting to the database.

0
source share

In the end, I decided to just work around this, catch the error and cancel the transaction using Django's transaction management. I am puzzled by why this happens, though ...

0
source share

To solve my similar problems with django / postgress, I am now doing something like this

 class SafeTextField(models.TextField) def get_prep_value(self, value): encoded = base64.encodestring(value).strip() return super(SafeTextField, self).get_prep_value(encoded) def to_python(self, value): decoded = base64.decodestring(value) return super(SafeTextField, self).to_python(decoded) 
0
source share

All Articles