Convert or cut "illegal" Unicode characters

Question

Convert or cut "illegal" Unicode characters

I have a database in MSSQL that I am migrating to SQLite / Django. I am using pymssql to connect to a database and store a text field in a local SQLite database.

However, for some characters it explodes. I receive complaints as follows:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is it possible to somehow convert characters to the correct version of Unicode? Or cut them out?

+7

python unicode pymssql

Oli Mar 24 '10 at 15:14

source share

2 answers

When you decode, just pass “ignore” to break these characters

There is another way to remove / convert:

 'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 'ignore': ignore malformed data and continue without further notice 'backslashreplace': replace with backslashed escape sequences (for encoding only)

Test

 >>> "abcd\x97".decode("ascii") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128) >>> >>> "abcd\x97".decode("ascii","ignore") u'abcd'

+11

YOU Mar 24 '10 at 15:18

source share

Alex martelli · Accepted Answer · 2010-03-24T15:22:13+0000

As soon as you have a string of bytes s , instead of directly using it as a unicode obj, I will explicitly convert it to the right codec, for example:

 u = s.decode('latin-1')

and use u instead of s in the code that follows this point (presumably the part that is written in sqlite). Suppose latin-1 is the encoding that was used for the original byte string - we cannot guess, so try to find out; -).

As a rule, I suggest: do not process any text in the form of encoded byte strings in your applications - decode them into unicode objects immediately after input and, if necessary, encode them back to byte strings immediately before output.

Convert or cut "illegal" Unicode characters

More articles: