I would borrow the Python 3 tokenize.detect_encoding() function in Python 2, slightly adjusted to meet the expectations of Python 2. I changed the signature of the function to accept the file name and rejected the inclusion of lines read so far; you don't need the ones you need:
import re from codecs import lookup, BOM_UTF8 cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)') blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)') def _get_normal_name(orig_enc): """Imitates get_normal_name in tokenizer.c."""
Like the original function, the above function reads two lines of max from the source file and throws a SyntaxError exception if the encoding in the cookie is invalid or not UTF-8, while there is a UTF-8 specification.
Demo:
>>> import tempfile >>> def test(contents): ... with tempfile.NamedTemporaryFile() as f: ... f.write(contents) ... f.flush() ... return detect_encoding(f.name) ... >>> test('# -*- coding: utf-8 -*-\n') 'utf-8' >>> test('#!/bin/env python\n# -*- coding: latin-1 -*-\n') 'iso-8859-1' >>> test('import this\n') 'ascii' >>> import codecs >>> test(codecs.BOM_UTF8 + 'import this\n') 'utf-8-sig' >>> test(codecs.BOM_UTF8 + '# encoding: latin-1\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test File "<string>", line 37, in detect_encoding File "<string>", line 24, in find_cookie SyntaxError: encoding problem for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpxsqH8L': utf-8 >>> test('# encoding: foobarbaz\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test File "<string>", line 37, in detect_encoding File "<string>", line 18, in find_cookie SyntaxError: unknown encoding for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpHiHdG3': foobarbaz
source share