I think the best answer (in Python 3) is to use the errors= parameter:
with open('evil_unicode.txt', 'r', errors='replace') as f: lines = f.readlines()
Evidence:
>>> s = b'\xe5abc\nline2\nline3' >>> with open('evil_unicode.txt','wb') as f: ... f.write(s) ... 16 >>> with open('evil_unicode.txt', 'r') as f: ... lines = f.readlines() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte >>> with open('evil_unicode.txt', 'r', errors='replace') as f: ... lines = f.readlines() ... >>> lines [' abc\n', 'line2\n', 'line3'] >>>
Note that errors= can be replace or ignore . Here is what ignore looks like:
>>> with open('evil_unicode.txt', 'r', errors='ignore') as f: ... lines = f.readlines() ... >>> lines ['abc\n', 'line2\n', 'line3']
caleb source share