Data structure for quick position search

Question

Data structure for quick position search

You are looking for a data structure that logically represents a sequence of elements supported by unique identifiers (for simplicity, consider them as strings or, at least, hashed objects). Each element can appear only once, there are no spaces, and the first position is 0.

The following operations should be supported (shown using single-letter strings):

insert(id, position) - add an element with the id key to the sequence with offset position . Naturally, the position of each element later in the sequence now increases by one. Example: [SELF].insert(H, 1) -> [SHELF]
remove(position) - remove an element with an offset position . Decreases the position of each element later in the sequence by one. Example: [SHELF].remove(2) -> [SHLF]
lookup(id) - find the position of the element with the id key. [SHLF].lookup(H) -> 1

A naive implementation will be either a linked list or an array. Both would give O (n) lookup , remove and insert .

In practice, lookup will most likely be used, with insert and remove occurring often enough so that it would be nice not to be linear (which a simple combination of hashmap + array / list would give you).

In an ideal world, this would be an O (1) lookup , O (log n) insert / remove , but I really suspect that this is not from a purely theoretical and theoretical point of view (although I have not tried it), therefore O (log n) lookup still be enjoyable.

+6

language-agnostic data-structures

agnoster Aug 18 '12 at 10:58

source share

3 answers

You can achieve everything in O (sqrt (n)) time, but I will warn you that it will take some work.

Start by looking at the blog post I wrote on ThriftyList . ThriftyList is my implementation of the data structure described in Volatile Arrays in optimal time and space , as well as some settings to support S (sqrt (n)) circular subscriptions, each of which is O (sqrt (n)) in size. With circular sublists, you can perform O / sqrt (n) time setting / deletion by standard insert / delete-then-shift in the containing sublist, followed by a series of push / pop operations on circular sublists.

Now, to get the index at which the query value falls, you will need to save the map from the value to the subscriptions / absolute index. That is, this value is displayed in the sublist containing the value, plus the absolute index at which the value falls (the index by which the element fell was not circular). From this data, you can calculate the relative index of the value by taking the offset from the head of the circular sublist and summing with the number of elements that are behind the containing subscriptions. To save this map, O (sqrt (n)) operations are required for each insert / delete.

+3

Kevin L. Stern Aug 18 '12 at 11:59

source share

Sounds like Clojure constant vectors - they provide an O (log32 n) cost for searching and updating. For small values of n O (log32 n) is not worse than a constant ....

Basically, these are array-matched attempts .

Not quite sure about the time complexity for deleting and inserting - but I'm sure you can get a variant of this data structure using O (log n) and delete and insert.

Watch this presentation / video: http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey

Source code (Java): https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java

-1

mikera Aug 18 '12 at 11:17

source share

Evgeny Kluev · Accepted Answer · 2012-08-18T13:30:10+0000

The combination of trie and hash map allows O (log n) to search / insert / delete.

Each node from the trie contains an id, as well as a counter of valid elements embedded in this node and up to two child pointers. The bit string defined by the left (0) or right (1) turn when trie passes from its root to the given node is part of the value stored in the hash map for the corresponding id.

Delete trie node operation labels as invalid and update all valid element counts from the deleted node to the root. It also deletes the corresponding entry in the hash map.

The Insert operation must use the position parameter and the counters of the actual elements in each trie node to search for new node predecessors and successor nodes. If the crawl in order from predecessor to successor contains any remote nodes, select one of them with the lowest rank and reuse it. Otherwise, select either the predecessor or the successor, and add a new child node to it (the right child for the predecessor or left for the successor). Next, update all valid counters on the path from this node to the root and add the corresponding entry to the hash map.

The search operation gets the bit string from the hash map and uses it to go from the trie root to the corresponding node, summing up all the counts of the real elements to the left of this path.

All this allows us to expect the expected time O (log n) for each operation, if the sequence of insertions / deletions is random enough. If not, the worst complexity of each operation is O (n). To return it to O (log n), depreciate the complexity, look at the sparsity and balancing factors of the tree, and if there are too many remote nodes, re-create a new perfectly balanced and dense tree; if the tree is too unbalanced, rebuild the most unbalanced subtree.

Instead of a hash map, you can use some binary search tree or any dictionary data structure. Instead of the bit string used to identify the path in trie, the hash map can store a pointer to the corresponding node in the tray.

Another alternative to using trie in this data structure is an Indexable Skipist .

O (log N) time for each operation is acceptable, but not perfect. It is possible, as Kevin explained, to use an algorithm with O (1) search complexity in exchange for the greater complexity of other operations: O (sqrt (N)). But it can be improved.

If you select a number of memory accesses (M) for each search operation, other operations can be performed in O (M * N ^{1 / M} ) time. The idea of such an algorithm is presented in this answer to the corresponding question . The Trie structure described there makes it easy to convert position to array index and vice versa. Each non-empty element of this array contains an id, and each hash map element maps this identifier back to the array index.

To add an element to this data structure, each block of adjacent elements in the array must alternate with some empty space. When one of the blocks runs out of all available empty space, we must rebuild the smallest group of blocks related to some trie element that has more than 50% free space. When the total amount of empty space is less than 50% or more than 75%, we must rebuild the whole structure.

This balancing scheme gives O (MN ^{1 / M} ) amortized complexity only for random and evenly distributed inserts / abstractions. Worse, the complexity of the case (for example, if we always push it to the far left) is much greater for M> 2. To guarantee the worst case of O (MN ^{1 / M} ), we need to reserve more memory and change it so that it retains this invariant: keep the empty space reserved for the entire structure at least 50%, keep the empty space for all data related to the vertices of the top level, at least 75%, for the nodes of the next level level - 87.5%, etc.

With M = 2, we have O (1) time for searching and O (sqrt (N)) time for other operations.

With M = log (N), the time is O (log (N)) for each operation.

But in practice, small values of M are preferable (for example, 2 .. 5). This can be considered as the search time O (1) and allows this structure (when performing a typical insert / delete operation) to work with up to 5 relatively small contiguous blocks of memory using caching with good vectorization capabilities. In addition, it limits memory requirements if we need complex worst-case complexity.

Data structure for quick position search

More articles: