Tracking implicit Unicode conversions in Python 2

I have a large project where problematic implicit Unicode conversions (coversions) were used in different places in the form, for example:

someDynamicStr = "bar" # could come from various sources # works u"foo" + someDynamicStr u"foo{}".format(someDynamicStr) someDynamicStr = "\xff" # uh-oh # raises UnicodeDecodeError u"foo" + someDynamicStr u"foo{}".format(someDynamicStr) 

(Perhaps other forms).

Now I would like to track these usages, especially those that are used in heavily used code.

It would be great if I could easily replace the unicode constructor with a wrapper that checks if the input is of type str and the encoding / errors parameters are set by default and then notifies me (trace fingerprint or such).

/editing:

Not directly related to what I'm looking for, I came across this glorious horrible hack to exclude a decoding exception (only decoding, i.e. str to unicode , but not vice versa) around, see https: //mail.python .org / pipermail / python-list / 2012-July / 627506.html ).

I do not plan to use it, but it may be interesting for those who are struggling with invalid Unicode inputs and looking for a quick fix (but please think about side effects):

 import codecs codecs.register_error("strict", codecs.ignore_errors) codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively 

(An Internet search of codecs.register_error("strict" showed that it was apparently used in some real projects.)

/ edit # 2:

For explicit conversions, I made a snippet using the SO message after monkeypatching :

 class PatchedUnicode(unicode): def __init__(self, obj=None, encoding=None, *args, **kwargs): if encoding in (None, "ascii", "646", "us-ascii"): print("Problematic unicode() usage detected!") super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs) import __builtin__ __builtin__.unicode = PatchedUnicode 

This only affects explicit conversions using the unicode() constructor, so this is not what I need.

/ edit # 3:

The thread “ Extension Method for python built-in types! ” Makes me think it might not be easy (at least in CPython).

/ edit # 4:

It's nice to see a lot of good answers here, too bad, I can only give out a reward once.

At the same time, I came across a somewhat similar question, at least in the sense of what the person was trying to achieve: Can I turn off implicit Unicode conversions to Python, error strings? Please note that if an exception was thrown, it was not in order. Here I was looking for something that could point me to different places in the problem code (for example, by printing something), but not something that could exit the program or change its behavior (because in this way I can determine priorities for correction).

On the other hand, people working on the Mypy project (including Guido van Rossum) may also come up with something similar useful in the future, see discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208 .

/ edit # 5

I also stumbled upon the following, but have not had time to check it out yet: https://pypi.python.org/pypi/unicode-nazi

+7
python debugging unicode monkeypatching
source share
4 answers

You can register a custom encoding that prints a message when it is in use:

Code in ourencoding.py :

 import sys import codecs import traceback # Define a function to print out a stack frame and a message: def printWarning(s): sys.stderr.write(s) sys.stderr.write("\n") l = traceback.extract_stack() # cut off the frames pointing to printWarning and our_encode l = traceback.format_list(l[:-2]) sys.stderr.write("".join(l)) # Define our encoding: originalencoding = sys.getdefaultencoding() def our_encode(s, errors='strict'): printWarning("Default encoding used"); return (codecs.encode(s, originalencoding, errors), len(s)) def our_decode(s, errors='strict'): printWarning("Default encoding used"); return (codecs.decode(s, originalencoding, errors), len(s)) def our_search(name): if name == 'our_encoding': return codecs.CodecInfo( name='our_encoding', encode=our_encode, decode=our_decode); return None # register our search and set the default encoding: codecs.register(our_search) reload(sys) sys.setdefaultencoding('our_encoding') 

If you import this file at the beginning of our script, you will see warnings for implicit conversions:

 #!python2 # coding: utf-8 import ourencoding print("test 1") a = "hello " + u"world" print("test 2") a = "hello ☺ " + u"world" print("test 3") b = u" ".join(["hello", u"☺"]) print("test 4") c = unicode("hello ☺") 

exit:

 test 1 test 2 Default encoding used File "test.py", line 10, in <module> a = "hello ☺ " + u"world" test 3 Default encoding used File "test.py", line 13, in <module> b = u" ".join(["hello", u"☺"]) test 4 Default encoding used File "test.py", line 16, in <module> c = unicode("hello ☺") 

This is not ideal, as test 1 shows, if the converted string contains only ASCII characters, sometimes you will not see a warning.

+4
source share

What you can do is the following:

First create a custom encoding. I will call this "lascii" for "ASCII logging":

 import codecs import traceback def lascii_encode(input,errors='strict'): print("ENCODED:") traceback.print_stack() return codecs.ascii_encode(input) def lascii_decode(input,errors='strict'): print("DECODED:") traceback.print_stack() return codecs.ascii_decode(input) class Codec(codecs.Codec): def encode(self, input,errors='strict'): return lascii_encode(input,errors) def decode(self, input,errors='strict'): return lascii_decode(input,errors) class IncrementalEncoder(codecs.IncrementalEncoder): def encode(self, input, final=False): print("Incremental ENCODED:") traceback.print_stack() return codecs.ascii_encode(input) class IncrementalDecoder(codecs.IncrementalDecoder): def decode(self, input, final=False): print("Incremental DECODED:") traceback.print_stack() return codecs.ascii_decode(input) class StreamWriter(Codec,codecs.StreamWriter): pass class StreamReader(Codec,codecs.StreamReader): pass def getregentry(): return codecs.CodecInfo( name='lascii', encode=lascii_encode, decode=lascii_decode, incrementalencoder=IncrementalEncoder, incrementaldecoder=IncrementalDecoder, streamwriter=StreamWriter, streamreader=StreamReader, ) 

What this does is basically the same as an ASCII codec, only that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.

Now you need to make it available to the codec module so that it can be found by the name "lascii". To do this, you need to create a search function that returns lascii-codec when it is served with the string "lascii". Then it is registered in the codec module:

 def searchFunc(name): if name=="lascii": return getregentry() else: return None codecs.register(searchFunc) 

The last thing left to do is tell the sys module to use "lascii" as the default encoding:

 import sys reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python sys.setdefaultencoding('lascii') 

Caution: This uses some outdated or otherwise not recommended features. This may be inefficient or error free. Do not use in production, only for testing and / or debugging.

+2
source share

Just add:

 from __future__ import unicode_literals 

at the beginning of your source code files - this should be the first import, and it should be affected in all source code files, and the headache of using Unicode in Python-2.7 goes away. If you haven’t done anything super weird with strings, then it should get rid of the problem out of the box.
See the following copy & paste from my console - I tried with a sample from your question:

 user@linux2 :~$ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> someDynamicStr = "bar" # could come from various sources >>> >>> # works ... u"foo" + someDynamicStr u'foobar' >>> u"foo{}".format(someDynamicStr) u'foobar' >>> >>> someDynamicStr = "\xff" # uh-oh >>> >>> # raises UnicodeDecodeError ... u"foo" + someDynamicStr Traceback (most recent call last): File "<stdin>", line 2, in <module> uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) ">>> u"foo{}".format(someDynamicStr) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) >>> 

And now with __future__ magic:

 user@linux2 :~$ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from __future__ import unicode_literals >>> someDynamicStr = "bar" # could come from various sources >>> >>> # works ... u"foo" + someDynamicStr u'foobar' >>> u"foo{}".format(someDynamicStr) u'foobar' >>> >>> someDynamicStr = "\xff" # uh-oh >>> >>> # raises UnicodeDecodeError ... u"foo" + someDynamicStr u'foo\xff' >>> u"foo{}".format(someDynamicStr) u'foo\xff' >>> 
+2
source share

I see that you have a lot of changes regarding the decisions that you have encountered. I'm just going to refer to your original post, which I believe is "I want to create a wrapper around the unicode constructor that checks the input."

The unicode method is part of the Python standard library. You will decorate the unicode method to add checks to the method.

 def add_checks(fxn): def resulting_fxn(*args, **kargs): # this is where whether the input is of type str if type(args[0]) is str: # do something # this is where the encoding/errors parameters are set to the default values encoding = 'utf-8' # Set default error behavior error = 'ignore' # Print any information (ie traceback) # print 'blah' # TODO: for traceback, you'll want to use the pdb module return fxn(args[0], encoding, error) return resulting_fxn 

Using this will look like this:

 unicode = add_checks(unicode) 

We overwrite the existing function name so that you do not have to change all the calls in a large project. You want to do this very early at run time so that subsequent calls have new behavior.

-3
source share

All Articles