Python bulletproof encoding

Question about unicode in Python2.

As I know about this, I always owe decodeeverything that I read from the outside (files, network). decodeconverts external bytes to internal Python strings using the character set specified in the parameters. Thus, decode("utf8")means that external bytes are unicode string and they will be decoded into python strings.

Also, I should always encodewrite everything that I write on the street. I specify the encoding in the parameters of the function encodeand converts it to the correct encoding and writes.

These statements are correct, right?

But sometimes, when I parse html documents, I get decoding errors. Since I understand the document in a different encoding (for example, cp1252), and an error occurs when I try to decode this using utf8 encoding. So the question is how to write a bulletproof application?

I found that there is a good library to guess the chardet encoding , and this is the only way to write bulletproof applications. Correctly?

+4
source share
3 answers

... decode("utf8")means external bytes are unicode strings and they will be decoded into python strings.

...

These statements are correct, right?

, , unicode. , <str>.decode("utf8") Python unicode, <str> UTF-8; , UTF-8.

. - , , - , . , , HTML , , , HTML, , ( ). , HTML- , , . , .

, chardet ( ), ( , ). , , .

+1

try: except: calls.

  • utf-8:
  • , utf-8:
  • , :
  • .. ..

, str, ( ) , , None , , .

, -, .

, , - , , , , .

0

Convert to unicodec cp437. That way you get your bytes in Unicode and back.

-1
source

All Articles