You can find the answer in the redis dict.c source file. Then I will give a part of it.
The iteration works as follows:
- First you call the function using the cursor value (v) 0. 2)
- The function performs one iteration step and returns a new cursor value that you should use in the next call.
- When the return cursor is 0, the iteration is complete.
The function ensures that all elements present in the dictionary are returned between the beginning and end of the iteration. However, some items may be returned several times. For each element returned, the fn callback argument is called with privdata as the first argument, and the entry'de dictionary element is called the second argument.
How it works
The iterative algorithm was developed by Pieter Noordhuis . The basic idea is to increase the cursor starting with higher order bits. That is, instead of the usual cursor, the cursor changes to the opposite, then the cursor increases, and finally, the bits change again.
This strategy is necessary because the hash table can be changed in the range between iterative calls. dict.c hash tables always have a power of two in size, and they use a chain, so the position of the element in this table is determined by calculating the bitwise AND between Hash (key) and SIZE-1 (where SIZE-1 is always a mask that is equivalent to the rest parts of the separation between the key hash and SIZE).
For example, if the current size of the hash table is 16, the mask (in binary form) is 1111. The key position in the hash table will always be the last four bits of the hash output, etc.
What happens if the table changes in size?
If the hash table grows, elements can be moved anywhere on the same edge of the old bucket: for example, suppose we have already iterated using the 4-bit cursor 1100 (mask 1111 because the size of the hash table = 16).
If the hash table is changed to 64 elements, the new mask will be 111111. The new buckets that you get, replacing 1100 with 0 or 1, can only target keys that we have already visited when scanning the 1100 bucket in a smaller hash table.
Iterating the most significant bits first, the cursor does not need to be restarted due to the inverted counter if the table becomes larger. It will continue to iterate using cursors without the “1100” at the end, as well as without any other combination of the last 4 bits that have already been explored.
Similarly, when the size of the table decreases over time, for example, from 16 to 8, if the combination of the three lower bits (mask for size 8 is 111) has already been fully explored, it will not be visited again, because we are sure that we tried, for example, both 0111 and 1111 (all variants of a higher bit), so we do not need to test it again.
Wait ... You have two tables during paraphrasing!
Yes, it’s true, but first we first iterate over the smaller table, then check all the extensions of the current cursor to the large table. For example, if the current cursor is 101, and we also have a large table of size 16, we also test (0) 101 and (1) 101 inside the larger table. This reduces the problem to only one table, where the larger, if it exists, is simply an extension of the smaller.
Limitations
This iterator is completely stateless, and this is a huge advantage, including the lack of additional memory. Disadvantages resulting from this design:
- Maybe we will return the items more than once. However, this is usually easy to handle at the application level.
- The iterator should return several elements for each call, since it should always return all keys bound in this bucket, and all extensions, so we are sure that we will not miss the keys that move during the reboot.
- The back cursor is hard to understand at first, but this comment should help.