In short, you should change:
Unicode(500)
in
Unicode(500, unicode_errors='ignore', convert_unicode='force')
(Python 2 code is given, but the principles are stored in python 3, only some of them will be different.)
What happens is that when decoding a byte string, it complains that the byte string cannot be decoded with the error you saw.
>>> u = u'ABCDEFGH\N{TRADE MARK SIGN}' >>> u u'ABCDEFGH\u2122' >>> print(u) ABCDEFGHβ’ >>> s = u.encode('utf-8') >>> s 'ABCDEFGH\xe2\x84\xa2' >>> truncated = s[:-1] >>> truncated 'ABCDEFGH\xe2\x84' >>> truncated.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/cliffdyer/.virtualenvs/edx-platform/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-9: unexpected end of data
Python provides various optional decoding error handling modes. Raising an exception is the default, but you can also trim the text or convert the invalid part of the string to the official unicode replacement character.
>>> trunc.decode('utf-8', errors='replace') u'ABCDEFGH\ufffd' >>> trunc.decode('utf-8', errors='ignore') u'ABCDEFGH'
This is exactly what happens in column processing.
Looking at the Unicode and String classes in sqlalchemy / sql / sqltypes.py there seems to be an unicode_errors argument, which you can pass a constructor that passes its value through the encoder error argument. There is also a note that you need to set convert_unicode='force' for it to work.
Thus, Unicode(500, unicode_errors='ignore', convert_unicode='force') should solve your problem if you are ok with truncating the ends of your data.
If you have some control over the database, you should be able to prevent this problem in the future by specifying a character set in your utf8mb4 database. (Don't just use utf8 , or it will crash on four bytes of utf8, including most emojis). Then you will be guaranteed to have a valid utf-8, which is stored and returned from your database.