Efficiency of long (str) keys in a python dictionary

I am parsing some xml (with some python 3.4 code) and want to get both the text from node and its id attribute. Example: <li id="12345"> Some text here </li> My current code is structured only by text (now I add an identifier, but did not need it before). I look at the list of texts / sentences and then get to work. So I thought about making a dictionary with text / sentence as a key, and this id attribute as a value.

However, this is not very effective. The text can be a whole paragraph, which makes the key very long. While id always has a rather limited length (but still type str, although, for example, some alpha characters, followed by some digits). But to enter the identifiers, the key and the text value require some rewriting of the code. All this is not very problematic, but it only made me think: how ineffective would it be to have text (perhaps a whole paragraph) as a key, compared with an identifier like "ulp_887362487687678" as a key?

I can just make two reverse dictionaries (one with id as a key, the other with text as a key) and compare the construction and search and that’s it. And I also found some topics on key length restriction ( Do dictionaries have a key length restriction? ). But I just wonder what you think about it. Is there such a long key in your dict that you definitely want to avoid, or is it not a very big deal? If you could share some pro / con, that would be great!

+6
source share
3 answers

No, the Python string length is unlikely to affect dictionary performance. The only effect that the length of the string on the hash() function used may have is that the key is mapped to a hash table slot.

Line length has very little effect on hash() performance:

 >>> import random >>> from timeit import timeit >>> from string import ascii_letters >>> generate_text = lambda len: ''.join([random.choice(ascii_letters) for _ in xrange(len)]) >>> for i in range(8): ... length = 10 + 10 ** i ... testword = generate_text(length) ... timing = timeit('hash(t)', 'from __main__ import testword as t') ... print 'Length: {}, timing: {}'.format(length, timing) ... Length: 11, timing: 0.061537027359 Length: 20, timing: 0.0796310901642 Length: 110, timing: 0.0631730556488 Length: 1010, timing: 0.0606122016907 Length: 10010, timing: 0.0613977909088 Length: 100010, timing: 0.0607581138611 Length: 1000010, timing: 0.0672461986542 Length: 10000010, timing: 0.080118894577 

I stopped when generating a string of 10 million characters, because I did not want to worry, waiting for my laptop to generate a string of 100 million characters.

The timings are pretty constant because the value is actually cached to the string object after calculation.

+11
source

The performance of hash() really O(n) for strings, but the result is cached in the string - repeated calls will use the cached value. This is possible because the rows are immutable. The Martin code uses the repeating timeit function, so you cannot see this effect, because in the latter case, 10000009 times out of 10000010 the hash code is not calculated.

This is still O (n) below:

 import random from timeit import timeit for i in range(10): length = 10 ** i # notice number=1 !!! timing = timeit('hash(t)', 't = "a" * {}'.format(length), number=1) print('Length: {:10d}, timing: {:.20f}'.format(length, timing)) Length: 1, timing: 0.00000437500057159923 Length: 10, timing: 0.00000287900184048340 Length: 100, timing: 0.00000342299972544424 Length: 1000, timing: 0.00000459299917565659 Length: 10000, timing: 0.00002153400055249222 Length: 100000, timing: 0.00006719700104440562 Length: 1000000, timing: 0.00066680999952950515 Length: 10000000, timing: 0.00673243699930026196 Length: 100000000, timing: 0.04393487600100343116 Length: 1000000000, timing: 0.39340837700001429766 

The difference is due to synchronization errors, branch prediction, etc.

+2
source

In most cases, @Martijn Pieters answer is correct, i.e. theoretically. However, in practice, you want to consider a lot of things when it comes to performance.

I recently ran into the problem of hashing long strings as keys , and in the practice that I am doing, a timeout error occurred only due to hashing of the Python dictionary key. I knew this because I solved the issue using the JavaScript object as a β€œdictionary”, and it worked just fine, which means there was no timeout.

Then, since my keys are actually a long string of lists of numbers, I instead made them tuples of numbers (the key can be an immutable object). This works fine as well.

At the same time, I tested the time using the hash function that @Martijn Pieters wrote in the example with a long line of a list of numbers as keys to the version of tuples. The tuple version takes longer on repl.it, their python compiler. I am not talking about 0.1 difference. This is the difference between 0.02 and 12.02 .

Strange, isn't it ?! :>

Now the thing is, every environment is changing. The volume of your operations is accumulating . Thus, you CANNOT just say whether a particular operation will last longer or shorter. Even if this operation takes 0.01 s, that is, only 1000 times, the user waits 10 s.

For any production environment, you really want to try to optimize your algorithm, if necessary, and always use the best design. For regular software, this saves valuable time for your users. For cloud services, these will be the dollar bills we are talking about.

Finally, I definitely DO NOT recommend using long strings as keys just because of the conflicting results I got in different environments. You definitely want to use identifiers as keys and iterate over string values ​​to find identifiers if you need to. But if you need to use a long string as keys, consider limiting the number of operations on the dictionary. Storing two versions is certainly a waste of space / RAM. The topic of performance and memory is another lesson.

0
source

Source: https://habr.com/ru/post/1211921/


All Articles