I don't understand encoding and decoding in Python (2.7.3)

I tried to understand encode and decode in Python itself, but nothing is clear to me.

  • str.encode([encoding,[errors]])
  • str.decode([encoding,[errors]])

First, I don’t understand the need for “coding” in these two functions.

What is the result of each function, its encoding? What is the use of the "encoding" parameter in each function? I really don't understand the definition of a string of bytes.

I have an important question, is there a way to move from one encoding to another? I read some text in ASN.1 about the "octet string", so I wondered if it was the same as the "byte string".

Thanks for the help.

+8
python string unicode decode encode
source share
4 answers

This is a bit more complicated in Python 2 (compared to Python 3), since it pretty much combines the concepts of "string" and "bytestring", but see Absolute minimum. Every software developer is absolutely sure to know about Unicode and character sets . Essentially, you need to understand that “string” and “character” are abstract concepts that cannot be directly represented by a computer. A byte string is a raw stream of bytes directly from the disk (or which can be written directly from the disk). encode goes from abstract to concrete (you give it preferably a Unicode string, and it returns you a byte string); decode goes the other way around.

Encoding is a rule in which "a" must be represented by byte 0x61 and "α" by a two-byte sequence 0xc0\xb1 .

+19
source share

My presentation from PyCon, Pragmatic Unicode, or, how do I stop the pain , covers all of these details.

In short, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. Encoding is a way of representing Unicode code points as a series of bytes. Thus, unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".

+14
source share

Python 2.x has two types of strings:

  • str = "byte strings" = octet sequence. They are used both for "obsolete" character encodings (such as windows-1252 or IBM437 ) and for raw binary data (for example, the output of struct.pack ).
  • unicode = "Unicode strings" = sequence of UTF-16 or UTF-32 depending on how Python is built.

This model has been modified for Python 3.x :

  • 2.x unicode became 3.x str (and the u prefix was removed from literals).
  • The bytes type was introduced to represent binary data.

A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string to a byte string, use the encode method:

 >>> u'\u20AC'.encode('UTF-8') '\xe2\x82\xac' 

To convert another way, use the decode method:

 >>> '\xE2\x82\xAC'.decode('UTF-8') u'\u20ac' 
+6
source share

Yes, a byte string is an octet string. Encoding and decoding occurs during text input / output (from / to the console, files, network, ...). Your console can use UTF-8 internally, your web server runs latin-1, and some file formats require strange encodings, such as Bibtex accents: fran\c{c}aise . You need to convert from / to them at the input / output.

The {en|de}code methods do this. They are often called backstage (for example, print "hello world" encodes a string no matter what your terminal uses).

+4
source share

All Articles