Get the encoding specified in the / shebang magic line (from inside the module)

Question

Get the encoding specified in the / shebang magic line (from inside the module)

If I specify the character encoding (as suggested by PEP 263 ) in the "magic line" or shebang of the python module, for example

# -*- coding: utf-8 -*-

Is it possible to get this encoding from this module?

(Work with Windows 7 x64 with Python 2.7.9)

I tried (without success) to get the default encoding or shebang

 # -*- coding: utf-8 -*- import sys from shebang import shebang print "sys.getdefaultencoding():", sys.getdefaultencoding() print "shebang:", shebang( __file__.rstrip("oc"))

will give:

sys.getdefaultencoding (): ascii
shebang: None

(same for iso-8859-1)

+6

python python-2.7 encoding character-encoding

hardmooth Jul 14 '16 at 12:40

source share

1 answer

Martijn pieters · Accepted Answer · 2016-07-14T12:50:28+0000

I would borrow the Python 3 tokenize.detect_encoding() function in Python 2, slightly adjusted to meet the expectations of Python 2. I changed the signature of the function to accept the file name and rejected the inclusion of lines read so far; you don't need the ones you need:

 import re from codecs import lookup, BOM_UTF8 cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)') blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)') def _get_normal_name(orig_enc): """Imitates get_normal_name in tokenizer.c.""" # Only care about the first 12 characters. enc = orig_enc[:12].lower().replace("_", "-") if enc == "utf-8" or enc.startswith("utf-8-"): return "utf-8" if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \ enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")): return "iso-8859-1" return orig_enc def detect_encoding(filename): bom_found = False encoding = None default = 'ascii' def find_cookie(line): match = cookie_re.match(line) if not match: return None encoding = _get_normal_name(match.group(1)) try: codec = lookup(encoding) except LookupError: # This behaviour mimics the Python interpreter raise SyntaxError( "unknown encoding for {!r}: {}".format( filename, encoding)) if bom_found: if encoding != 'utf-8': # This behaviour mimics the Python interpreter raise SyntaxError( 'encoding problem for {!r}: utf-8'.format(filename)) encoding += '-sig' return encoding with open(filename, 'rb') as fileobj: first = next(fileobj, '') if first.startswith(BOM_UTF8): bom_found = True first = first[3:] default = 'utf-8-sig' if not first: return default encoding = find_cookie(first) if encoding: return encoding if not blank_re.match(first): return default second = next(fileobj, '') if not second: return default return find_cookie(second) or default

Like the original function, the above function reads two lines of max from the source file and throws a SyntaxError exception if the encoding in the cookie is invalid or not UTF-8, while there is a UTF-8 specification.

Demo:

 >>> import tempfile >>> def test(contents): ... with tempfile.NamedTemporaryFile() as f: ... f.write(contents) ... f.flush() ... return detect_encoding(f.name) ... >>> test('# -*- coding: utf-8 -*-\n') 'utf-8' >>> test('#!/bin/env python\n# -*- coding: latin-1 -*-\n') 'iso-8859-1' >>> test('import this\n') 'ascii' >>> import codecs >>> test(codecs.BOM_UTF8 + 'import this\n') 'utf-8-sig' >>> test(codecs.BOM_UTF8 + '# encoding: latin-1\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test File "<string>", line 37, in detect_encoding File "<string>", line 24, in find_cookie SyntaxError: encoding problem for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpxsqH8L': utf-8 >>> test('# encoding: foobarbaz\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test File "<string>", line 37, in detect_encoding File "<string>", line 18, in find_cookie SyntaxError: unknown encoding for '/var/folders/w0/nl1bwj6163j2pvxswf84xcsjh2pc5g/T/tmpHiHdG3': foobarbaz

Get the encoding specified in the / shebang magic line (from inside the module)

More articles: