Convert or cut "illegal" Unicode characters

I have a database in MSSQL that I am migrating to SQLite / Django. I am using pymssql to connect to a database and store a text field in a local SQLite database.

However, for some characters it explodes. I receive complaints as follows:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128) 

Is it possible to somehow convert characters to the correct version of Unicode? Or cut them out?

+7
python unicode pymssql
source share
2 answers

As soon as you have a string of bytes s , instead of directly using it as a unicode obj, I will explicitly convert it to the right codec, for example:

 u = s.decode('latin-1') 

and use u instead of s in the code that follows this point (presumably the part that is written in sqlite). Suppose latin-1 is the encoding that was used for the original byte string - we cannot guess, so try to find out; -).

As a rule, I suggest: do not process any text in the form of encoded byte strings in your applications - decode them into unicode objects immediately after input and, if necessary, encode them back to byte strings immediately before output.

+11
source share

When you decode, just pass โ€œignoreโ€ to break these characters

There is another way to remove / convert:

 'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' 'ignore': ignore malformed data and continue without further notice 'backslashreplace': replace with backslashed escape sequences (for encoding only) 

Test

 >>> "abcd\x97".decode("ascii") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128) >>> >>> "abcd\x97".decode("ascii","ignore") u'abcd' 
+11
source share

All Articles