Hash generator expression

Apparently, python will allow me to hash a generator expression like (i for i in [1, 2, 3, 4, 5])

 >>> hash(i for i in [1, 2, 3, 4, 5]) 8735741846615 

However, upon closer inspection, this hash value is always the same no matter which generator I put in it!

 >>> hash(i for i in range(2)) 8735741846615 >>> hash(i for i in [1, 2, 3, 4, 5, 6]) 8735741846615 >>> hash(i for i in [0, 1, 2, 3, 4, 5, 6]) 8735741846615 

Why is this happening? Why is generator hashing even allowed?

I need to do this because I store stuff in the dictionary cache. You access cache entries by passing lists of objects. But certain groups of lists with different elements must still point to the same record, since the only thing that matters here is the integer value of one attribute in each element of the list. Therefore, they should use the hash only on the basis of these integer values, and not on the elements themselves, in order to avoid unnecessary misses in the cache.

I know that you can always simply convert an expression to a tuple, but I asked if this could be bypassed simply by using the generator output without a tuple container, similar to how sum() can be used for such a thing.

+5
source share
4 answers

So there are actually two questions:

  • Why do generators have a hash when, for example, there are no lists?
  • Why do you always get the same hash?

For 2, the answer is simple: in this case, the hash is based on the id object. Since you are not actually storing the object, its memory is reused. This means that the next generator has the same id and therefore hash.

For 1, the answer is "because they can." hash is primarily intended for use in dict , set and other situations where it allows you to identify an object. These situations impose a restriction that a == b also implies hash(a) == hash(b) (the opposite is not limited).

Now, for list , dict and other collections, equality is content-based. [1,2,3] == [1,2,3] regardless of whether both of them are the same objects, for example. This means that if something is added to them, their equality will change and therefore their hash will change. Thus, hash is undefined, as it must be constant to work in dict , etc.

In contrast, a generator can have any content. Consider, for example, a generator that provides random values. Thus, it makes no sense to compare generators by content. They are compared only by identity. So a == b equals id(a) == id(b) for generators. In turn, this means that the underlying hash(a) on id(a) will always satisfy the equality constraint on hash .

+5
source

The hash value is apparently based on the identifier of the object, and not on its contents. You probably see this result because you do not store generators, so they are garbage collected and their identifiers are reused. Here are a few other examples that show this is not just one hash:

 >>> x = (i for i in [0, 1, 2, 3, 4, 5, 6]) >>> y = (i for i in [0, 1, 2, 3, 4, 5, 6]) >>> hash(x) -9223372036852064692 >>> hash(y) -9223372036852064683 >>> id(x) 43377864 >>> x = (i for i in [0, 1, 2, 3, 4, 5, 6]) >>> id(x) 43378296 >>> hash(x) -9223372036852064665 

As for why Python allows you to use them, it also allows you to hash all kinds of objects for the same reason: so you can separate different objects and use them as dictionary keys. This does not mean that you will talk about what the generator does, but which object. This is no different from this:

 >>> hash(object()) 225805 >>> hash(object()) 225805 >>> x = object() >>> y = object() >>> hash(x) 225805 >>> hash(y) 225806 

The idea that if an object is hashable, then it is immutable, is something of a misunderstanding. Objects can define any hash they want if it is compatible with their own definition of equality. For example, instances of custom classes are hashed in a manner similar to these generator functions (a hash based on an object identifier).

People sometimes seem to think that the rule "if an object is modified, it is not hashed," but this is not true. The actual rule is more like "if the object is not hashed, then it is modified." (More specifically, it defines the concept of equality, which depends on the state being changed, or it explicitly prohibits hashing for some perverse reason.) In other words, it usually doesn't make sense to ask why a particular hashed type is hashed; it makes sense to ask why the split type is unfastened. The built-in list of mutable types, dict and set have a specific reason for unlocking: they are intended for comparison based on their (mutable) values, and not on the identifier of the object. But just because some other type has some kind of internal state does not mean that it cannot be hashed. You can define objects that change and hashed throughout the day. They just have to define equality and hashability in compatible ways. Generators really cannot be compared for equality based on their contents:

 >>> x = (i for i in [0, 1, 2, 3, 4, 5, 6]) >>> y = (i for i in [0, 1, 2, 3, 4, 5, 6]) >>> x == y False 

For generators, value-based hashing would be unambiguously pointless because values ​​do not exist until they are actually obtained from the generator. If the hashing was based on values, there would be no way to hash a generator that, for example, would use random numbers to decide what comes next.

+4
source

Python object resolves hash and equality: hash(object()) .

The "irreversible type" is a targeted restriction on the main mutable collections (i.e. list ) with a certain comparison compared to the contents, as this avoids some common programming errors. In particular, when used as dictionary keys.

However, comparing the contents with the equivalent for the generator object is the same as comparing the identity: gen == gen implies gen is gen . This is no different than using hash/== with the result of object() : this is often not useful, and if so, ish-identity / equality is expected.

The result is either a hash or generator equality does not depend on future materialized content. Because of this, it does not have content-useful == and, therefore, it should not implement content-useful hash . If it should be a "disappointing type", then at best it becomes a gray area, because it is no longer used (or rather how) more useful than object() as a dictionary key.

Thus, it is acceptable that any value be returned for the hash of the generator (even if it is duplicated between different instances of the generator, if it meets the restriction hash(x) == hash(x) β†’ x == x ), if it is consistent for this instance of the generator (as gen == gen β†’ gen is gen ).

As an implementation detail, I would expect the generator hash to use either the same (as in, but not overloaded), or a similar implementation with the base object.

0
source

generator expression

An expression that returns an iterator

If iterators are hashed, generator expressions should not be an exception. Iterators, in terms of state, are not sure about variability. Or can it mean restraint? Moreover, iterators are just factory objects / values ​​and nothing more:

 >>> l = [1, 2] >>> it = iter(l) >>> l[:] = [4, 5] >>> next(it) 4 

The iterator is not associated with the initial values ​​of the list l . Thus, generator expressions are not hashed with respect to values ​​in range(2) or [1, 2, 3, 4, 5, 6] .

Hashing with the same value seems to be dependent on the interpreter, and probably this is because you have no reference to generator expressions. I have different results for each:

 >>> hash(i for i in range(2)) -2143682330 >>> hash(i for i in range(2)) -2143687189 >>> hash(i for i in [1, 2, 3, 4, 5, 6]) -2143679383 

If you named generator expressions, you will probably get different values:

 >>> ge1 = (i for i in range(2)) >>> ge2 = (i for i in [1, 2, 3, 4, 5, 6]) >>> hash(ge1) 3801919 >>> hash(ge2) -2143681547 
0
source

All Articles