Why is Python Hashlib not typing much?

Python is assumed to be strongly typed.

For example: 'abc'['1'] will not work because you must specify an integer, not a string. The error will be raised and you can continue and correct it.

But this is not the case with a hashlib. In fact, try the following:

 import hashlib hashlib.md5('abc') #Works OK hashlib.md5(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: md5() argument 1 must be string or read-only buffer, not int hashlib.md5(u'abc') #Works, but shouldn't : this is unicode, not str. haslib.md5(u'é') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) 

Of course, it does not fail due to TypeError , but because of UnicodeEncodeError . UnicodeEncodeError should be raised when you try to encode unicode into a string.

I think I'm not too far from the truth when I assume that Khashlib silently tried to convert Unicode to a string.

Now. I agree, hashlib indicated that the hashlib.md5() argument should be a read-only string or buffer, which is a unicode string. But in fact, this suggests that this is actually not the case: hashlib.md5() will work correctly with strings, and what about it.

Of course, the main problem is that you will get an exception with some unicode strings, and not some others.

Which leads me to my questions. First, do you have an explanation why hashlib implements this behavior? Secondly, is this considered a problem? Thirdly, is there a way to fix this without changing the module itself?

Hashlib is basically an example, there are several other modules that behave the same when using unicode strings - which leads to an uncomfortable situation where your program will work with ASCII input but will fail completely with accents.

+4
source share
2 answers

It's not just hashlib - Python 2 handles Unicode in several places, trying to code it as ascii. This was one of the big changes made for Python 3.

In Python 3, strings are unicode, and they behave as you expect: there is no automatic conversion to bytes, and you need to encode them if you want to use bytes (for example, for MD5 hashing). I believe that there are hacks using sys.setdefaultencoding that allow this behavior in Python 2, but I would advise against using them in production, because they will affect any code running in this Python instance.

+12
source

This is the result of using the Python 2.x C API, which makes it convenient to pass Unicode objects in C APIs that expect a string.

See the call to PyArg_ParseTuple * in _hashopenssl.c .

It will try to encode the Unicode object into a byte string when parsing it for the argument 's *'. If it cannot be encoded, the error will be raised. The right thing is to always call .encode ('utf-8') or any other codec your application needs before trying to use Unicode in a context where only the original byte stream makes sense.

Python 3.x fixes this. Instead, you will always be friendly:

TypeError: Unicode objects must be encoded before hashing

Instead of any automatic coding.

+2
source

All Articles