Better use a dict of whole keys or a very long list?

Question

Better use a dict of whole keys or a very long list?

Basically, I need to create a lookup table with immutable integer identifiers. I'm curious, anyway, in terms of search speed, I'm generally better off using dict with integer keys or using a very long list with lots of empty indexes. It seems to me that list can be even faster, because Python needs to know exactly where to look, but I wonder if there are any server processes with dict to compensate for and whether additional memory requirements are needed for these empty Slots list will deny (maybe ) more easily passed list . Are there alternatives to list and dict that might be better suited for this?

I saw this question, but it does not completely answer me: Comparing dictionary access speed with the whole key of a string key

ETA: I implement lookup tables like this twice in my program. One instance sees the maximum id of 5000 with a population of 70-100; the other has a maximum id of 750 with 20-30 populated.

+7

python python-2.7

Rus925 Jun 23 '14 at 5:45

source share

3 answers

Lists:
1. append and pop from the end of the list are fast
2. insert and pop from the beginning of the list are slow (there is a heavy shit operation behind these two functions)
3. it is better to use collection.degue for the second case.

Dictionaries:
4. Access operations are faster than lists

Quoting through dictionaries and lists:

Dictionaries use the iteritems() method to get the key and its corresponding value at the same time.
The list uses enumerate () for the same purpose.

Notes:
If your question is about loop speed, there is no tangible difference between iteritems () and enumeration ()
iteritems () has been removed in Python 3.x.
zip () is a tough process to avoid.

+1

user3522371 Jun 23 '14 at 6:15

source share

I think there is no general answer to this question. It depends on reworking the integer identifier, available memory, and performance requirements. Rules:

list searches are faster because you do not need to calculate the key hash.
a dict may be more compact if the largest key value is large.
If your largest key is very large (about 2 ^ 30), you will waste a lot of memory and the system will begin to replace, which will significantly degrade performance.

Here is what a rule of thumb can be:

if there are several "empty" values, if you know that the largest key will be "reasonably" low (relative to the memory that you take to spend on it) => use a list
if the following requirement is not verified and you do not have a strong performance requirement => use dict
If none of the two preceding assumptions is true, you will have to try some hash function optimizations - I will describe in detail below

The dict theory is an array for which the index is the result of a hash function applied to the key. The Python algorithm is correctly optimized, but is generalized. If you know that you have a special redistribution, you can try to find a hash specially adapted for your alteration. You could find pointers to go further in the Wikipedia article on Hash Functions or in the old old C hash standard library

+1

Serge Ballesta Jun 23 '14 at 6:33

source share

Bakuriu · Accepted Answer · 2014-06-23T06:14:10+0000

To answer the question about dict vs list , you will need to provide accurate information about the number of elements, the number of missing elements, etc., so we could accurately estimate the memory usage with two structure data and try to predict and / or check their effectiveness.

In the general case, the <key-value dict pairs for N require a bit more memory than list with N values:

dict should follow the keys
dict does not exceed 2/3. When this happens, the allocated memory doubles (this requires a dict ) to perform O (1) operations with time amortization.

However, there is an alternative to this data structure that should provide very good performance: blist . The blist package provides an interface that corresponds to the list interface, only it is implemented using B-trees. It can efficiently handle sparse lists. Most operations take O(1) or O(log n) , so they are quite efficient.

For example, you can first create a sparse blist to do:

 from blist import blist seq = blist([None]) seq *= 2**30 # create a 2**30 element blist. Instantaneous!

And then you can set only indexes that matter:

 for i, value in zip(indices, values): seq[i] = value

Full documentation here .

Note that blist provides other efficient operations, such as:

Combining two blist takes O(log n) time
Taking [i:j] slice takes O(log n) time
Inserting an element with a given index accepts O(log n) operations
The appearance of an element (from each position) performs O(log n) operations

Since you gave a few digits, here is how they compare with dict s:

 >>> from blist import blist >>> b = blist([None]) >>> b *= 5000 >>> for i in range(100):b[i] = i ... >>> b.__sizeof__() 2660 >>> d = dict() >>> for i in range(100):d[i] = i ... >>> d.__sizeof__() 6216 >>> b = blist([None]) >>> b *= 750 >>> for i in range(30):b[i] = i ... >>> b.__sizeof__() 1580 >>> d = dict() >>> for i in range(30):d[i] = i ... >>> d.__sizeof__() 1608

In both cases, a blist takes up less memory (in the first case, it takes 1/3 of the memory equivalent to dict ). Note that memory taken with blist also depends on whether the indices are contiguous (contiguous are better). However, even using random indexes:

 >>> b = blist([None]) >>> b *= 5000 >>> import random >>> for i in range(100):b[random.randint(0, 4999)] = i ... >>> b.__sizeof__() 2916

He is still much better than a dict .

Even search times are better:

 In [1]: from blist import blist ...: import random ...: In [2]: b = blist([None]) In [3]: b *= 5000 In [4]: for i in range(100):b[random.randint(0, 4999)] = i In [5]: %timeit b[0] 10000000 loops, best of 3: 50.7 ns per loop In [6]: d = dict() In [7]: for i in range(100):d[random.randint(0, 4999)] = i In [10]: %timeit d[1024] # 1024 is an existing key in this dictionary 10000000 loops, best of 3: 70.7 ns per loop In [11]: %timeit b[1024] 10000000 loops, best of 3: 50.7 ns per loop

Note that a list takes about 47 ns to search for an index on my machine, so blist really very close to list in terms of search performance in small lists like what you have.

Better use a dict of whole keys or a very long list?

More articles: