Best Practices for Python UnicodeDecodeError

I use Pylons framework, Mako for web application. I was not too worried about how Python handles unicode strings. I had a tense moment when I saw how my site crashed when a page was displayed, and later I found out that this was due to UnicodeDecodeError .

After looking at the error, I started to create a grid around my Python code, adding encoding, decoding calls for the line with the parameter “ignore”, but still I could not see that there were errors once.

Finally, I used to decode ascii with ignore and made the site without any glitches.

Login to my site goes through many sites. This means that I do not control the languages ​​or the language of choice. My site supports international languages ​​and English. I have feed aggregation that usually doesn't bother with unicode / ascii / utf-8. While I show the text through the mako template, I show it as it is.

Not being a web expert, what are the best string handling methods in a Python project? Should I only care about rendering the text or the entire application phase?

+4
source share
2 answers

If you have an effect on this, this is a painless way:

  • know your input code (or decode ignoring) and decode(encoding) data as soon as it gets into your application.
  • only works with unicode ( u'something' is unicode), also in the database
  • for rendering, export, etc., anytime it leaves your application, encode('utf-8') data
+10
source

this may not be a viable option for you, but let me say that a large number of coding errors disappear when using python 3, simply because the separation of Unicode strings and byte objects is made so clearer. when i have to use python 2 i choose version 2.6 where you can declare from future import unicode_literals . unbelievers should actually read the link you posted as it points out some subtleties with Python's en / decoding behavior, which, fortunately, disappeared in Python 3.

you speak

I do not control the languages ​​or the language of choice. My site supports international languages ​​and along with English. I have an aggregation that doesn't have to worry about Unicode / ASCII / UTF-8 at all

Well, whatever you choose, it’s clear that you don’t want your web application to fail just because some d blnish bløgger whose channels you consume decided to encode their messages in an obscure Scandinavian coding scheme. the main problem is relevant for all web applications, because the URLs do not contain encoding information, and because you never know what sequence of bytes a malicious user can send you. in this case, I do what I call "safe chain decoding": first I try to decode as utf-8, and if that doesn't work, try again using cp1252. if this fails, I drop the request (HTTP 404) or something like that.

You mentioned that you process channels and you? ¿Channels? Do not worry about Unicode and encodings. could you clarify this expression? he completely shies away from me how to successfully create a website that carries text in several languages ​​and does not care about encodings. explicitly using ascii-only, you will not be very far from you.

+2
source

All Articles