Why does dict have the worst case O (n) for so many operations?

How exactly is the dict implemented, does it have a linear time search for collisions? I would suggest that it is implemented as a hash table supported by a list. I suggest that the best implementation would be O (log (n)) for various operations, using a tree instead of a table instead. Is there any magic going on behind the scenes to keep the time constant looking for as long as possible?

My source for this, by the way, is the following:

http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=python+complexity

+5
source share
5 answers

A dict is O (1) for most operations, except for operations that apply to all elements, such as iteration and copy (in this case, obviously, O (n)).

See: http://wiki.python.org/moin/TimeComplexity

In the worst case, O (n), because you can always come up with a pathological example where all keys have the same hash value.

+9
source

The point of choosing one implementation over another is not necessarily related to the upper bound , but rather the expected amortized performance . While different algorithms may have degenerate cases, this is usually "better in practice" than using an approach with a provable lower bound. In some cases, however, structures must be designed to protect against pathologically poor input.

In addition, some languages ​​/ libraries - not sure about Python - actually change the underlying implementation, for example, when the number of elements exceeds low n. This affects amortized performance (in some cases), but not necessarily large O.

And in conclusion: "It depends."

Happy coding.

+2
source

Consider even the best hash function in the galaxy. There is still a chance that you can rise one day with a list of values ​​whose value the hash function is all the same. If you put them in a dict, the system has no choice but to perform linear searches.

Using a balanced tree will save the worst time in O (log n), but maintenance costs are quite high. Usually hash tables work pretty well.

+1
source

I would suggest that the best implementation would be O (log (n)) for various operations, using a tree instead of a table instead.

Trees and hash tables have very different requirements and performance characteristics.

  • Trees require an ordered type.
  • Trees need to do an order comparison to find an object. For some objects, such as strings, this prevents some significant optimizations: you always need to do string comparisons, which is non-trivially expensive. This makes the constant coefficient O (log n) quite high.
  • Hash table tables require a hashed type, and you can check for equality, but they do not require an ordered type.
  • Equality testing can be greatly optimized. If two lines are interned, you can check if they are equal in O (1) by comparing their pointer, not O (n), comparing the whole line. This is a massive optimization: in every search foo.bar , which translates to foo.__dict__["bar"] , "bar" is an intern string.
  • Hash tables are O (n) in the worst case, but study what leads to this worst case: a very bad hash table (for example, you only have one bucket) or a broken hash function that always returns the same value . When you have the correct hash function and the correct bucketing algorithm, the search is very cheap - very often comes close to a constant time.

Trees have significant advantages:

  • They tend to have lower memory requirements since they do not have to pre-allocate buckets. The smallest tree can be 12 bytes (a node pointer and two child pointers), where the hash table tends to be 128 bytes or more - sys.getsizeof ({}) on my system - 136.
  • They allow an orderly bypass; this is extremely useful to be able to iterate over [a, b) in an ordered set that does not allow hash tables.

I find it a drawback that Python does not have a standard binary tree-like container, but for the performance characteristics required by the Python core, such as __dict__ lookups, a hash table makes sense.

+1
source

Reliable sources of hash information and conflict resolution strategies that are actually used include comments in the dictobject.c source file and the entire dictnotes.txt file

0
source

All Articles