UnicodeDecodeError Download using sqlalchemy

I query the MySQL database with sqlalchemy and get the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 498-499: unexpected end of data 

The column in the table was defined as Unicode(500) , so this error tells me that there is an entry that was truncated because it was longer than 500 characters. Is there a way to handle this error and still load the record? Is there a way to find an erroneous entry and delete it, besides trying to load each entry one at a time (or in batches) until I get an error?

+7
python mysql unicode utf-8 sqlalchemy
source share
3 answers

In short, you should change:

 Unicode(500) 

in

 Unicode(500, unicode_errors='ignore', convert_unicode='force') 

(Python 2 code is given, but the principles are stored in python 3, only some of them will be different.)

What happens is that when decoding a byte string, it complains that the byte string cannot be decoded with the error you saw.

 >>> u = u'ABCDEFGH\N{TRADE MARK SIGN}' >>> u u'ABCDEFGH\u2122' >>> print(u) ABCDEFGHβ„’ >>> s = u.encode('utf-8') >>> s 'ABCDEFGH\xe2\x84\xa2' >>> truncated = s[:-1] >>> truncated 'ABCDEFGH\xe2\x84' >>> truncated.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected end of data 

Python provides various optional decoding error handling modes. Raising an exception is the default, but you can also trim the text or convert the invalid part of the string to the official unicode replacement character.

 >>> trunc.decode('utf-8', errors='replace') u'ABCDEFGH\ufffd' >>> trunc.decode('utf-8', errors='ignore') u'ABCDEFGH' 

This is exactly what happens in column processing.

Looking at the Unicode and String classes in sqlalchemy / sql / sqltypes.py there seems to be an unicode_errors argument, which you can pass a constructor that passes its value through the encoder error argument. There is also a note that you need to set convert_unicode='force' for it to work.

Thus, Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem if you are ok with truncating the ends of your data.

If you have some control over the database, you should be able to prevent this problem in the future by specifying a character set in your utf8mb4 database. (Don't just use utf8 , or it will crash on four bytes of utf8, including most emojis). Then you will be guaranteed to have a valid utf-8, which is stored and returned from your database.

+2
source share

Make the column you are storing in a BLOB . After loading the data, perform various actions, such as

  SELECT MAX(LENGTH(col)) FROM ... -- to see what the longest is in _bytes_. 

Copy the data to another BLOB column and execute

  ALTER TABLE t MODIFY col2 TEXT CHARACTER SET utf8 ... -- to see if it converts correctly 

If it succeeds, then do

  SELECT MAX(CHAR_LENGTH(col2)) ... -- to see if the longest is more than 500 _characters_. 

After you have tried several of these things, we can see which direction should be done next.

0
source share

In short, your MySQL setup is incorrect in that it truncates UTF-8 characters in the middle of a sequence. I would double check that MySQL really expects UTF-8 character encoding in sessions and in the tables themselves.


I would suggest switching to PostgreSQL (seriously) to avoid such a problem: not only PostgreSQL correctly understands UTF-8 in the default configurations, but it never truncates the line to fit in the value, choosing a raise instead:

 psql (9.5.3, server 9.5.3) Type "help" for help. testdb=> create table foo(bar varchar(4)); CREATE TABLE testdb=> insert into foo values ('aaaaa'); ERROR: value too long for type character varying(4) 

This also doesn't look like Zen Python:

Explicit is better than implicit.

and

Mistakes should never pass silently.
Unless explicitly disabled.
In the face of ambiguity, give up the temptation to guess.

0
source share

All Articles