How to add new encoding in python 2.6?

Another encoding problem, I'm dealing with an IBM mainframe using the IBM870 encoding, which is not supported by python or does not matter at all.

Fortunately, the gifted encoder has cracked a script that generates the appropriate encoding definitions for python using the character lists available in FileFormat.info

List Used: IBM870 Character List

The generated encoding can be seen here: cp870.py

This system is RHEL 6.3, working with python 2.6:

Python 2.6.6 (r266:84292, Aug 28 2012, 10:55:56) [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2 

cp870.py is placed in:

 /usr/lib64/python2.6/encodings/ 

The following entries have been added:

 /usr/lib64/python2.6/encodings/aliases.py # cp870 codec '870' : 'cp870', 'csibm870' : 'cp870', 'ibm870' : 'cp870', 

The alias is correctly parsed as shown here ( thanks to this answer ):

 >>> from encodings.aliases import aliases >>> def find(q): ... return [(k,v) for k, v in aliases.items() if q in k or q in v] ... >>> find('870') [('ibm870', 'cp870'), ('870', 'cp870'), ('csibm870', 'cp870')] >>> find('cp870') [('ibm870', 'cp870'), ('870', 'cp870'), ('csibm870', 'cp870')] >>> find('ibm870') [('ibm870', 'cp870'), ('csibm870', 'cp870')] 

When I tried to encode () some characters, it did not work as planned:

 >>> 'c'.encode('cp870') '\x83' >>> 'č'.encode('cp870') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.6/encodings/cp870.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) 

This is what '\ x83' should be according to cp870.py:

 u'\x83' # 0x23 -> NO BREAK HERE (U+0083) 

As I start with python, can someone tell me what else is needed for python to load and use this encoding correctly?

+4
source share
1 answer

In Python 2.x, unicode strings must be prefixed with u or U. Non-prefixed strings are in ASCII (or other 8-bit encoding).

In addition, python expects your input to be ASCII encoded (although you can configure a different encoding). So, when you put a character without ASCII characters in quotation marks, the interpreter tries to decode it as ASCII, which causes the error you see.

So you need to specify the u prefix and use the escape sequence to specify the character:

 U'\x83'.encode('cp870') 
+3
source

All Articles