Data structure for constructing and searching for a set of integer ranges

Question

Data structure for constructing and searching for a set of integer ranges

I have a set of uint32 integers, there can be millions of elements in the set. 50-70% of them are sequential, but they are displayed in an unpredictable order in the input stream.

I need:

Compress this set into ranges to provide spatial efficient performance. This has already been implemented using a trivial algorithm, since ranges calculated only once are not important here. After this, the number of transformations of the resulting ranges is usually within 5,000-10,000, many of which, of course, are separate elements.
To check that some integer belongs, information about a certain range in the set is not required. It should be very fast - O (1). Thought of minimal perfect hash functions , but they don't work very well with ranges. Bitsets are very inefficient. Other structures, such as binary trees, have O (log n) complexity, the worst with them is that the implementation makes many conditional branches, and the processor cannot predict them, giving poor performance.

Is there any data structure or algorithm specialized in whole ranges to solve this problem?

+5

set algorithm integer data-structures range

actual Jan 05 '11 at 7:31

source share

7 answers

If you know in advance what ranges are, you can check if a given integer is present in one of the ranges O (log n) using the strategy described below. This is not O (1), but in practice it is still pretty fast.

The idea behind this approach is that if you combine all the ranges together, you have a set of disjoint ranges on the number line. From there, you can determine the order at these intervals by indicating that the interval [a, b] & le; [c, d] iff b & le; from. This is a complete order because all ranges do not overlap. This way you can combine all the intervals into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O (n lg n) time.

To check if any interval contains a given integer, you can perform a binary search in this array. Starting at the intermediate interval, check to see if an integer is contained in this interval. If yes, then everything is ready. Otherwise, if the value is less than the smallest value in the range, continue searching on the left, and if the value is greater than the largest value in the range, continue searching on the right. This is essentially a standard binary search, and it should work in O (log n) time.

Hope this helps!

+6

templatetypedef Jan 05 '11 at 7:50

source share

AFAIK there is no algorithm that looks for a complete list in O (1).

One can search for O (1) with a huge amount of memory.

Thus, it is not very promising to try to find the O (1) search algorithm from the integer range list.

Alternatively, you can try the time / memory compilation approach by carefully examining your data set (eventually creating a hash table).

+2

9dan Jan 05 '11 at 8:01

source share

You can use y-fast trees or Van Emde Boas trees to achieve O (lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O (lg_w n) query times. The optimal compromise in terms of n is O (sqrt (log (n))).

The simplest one to implement is probably y-fast trees. They are probably faster than performing a binary search, although they require approximately two hash table queries O (lg w) = O (lg 32) = O (5), whereas for a binary search, approximately O (lg n) = O (lg 10000) = O (13), so binary search can be faster.

+2

jonderry Jan 05 '11 at 9:05

source share

Instead of storing / searching based on comparison (which will always be O (log (n))), you need to work with storage / retrieval based on "radix".

In other words .. extract extracts from uint32 and does trie ..

+1

vrdhn Jan 05 '11 at 8:09

source share

Keep ranges in a sorted array and use binary search to search.

It is easy to implement, O (log N), and uses less memory and requires less memory access than any other tree-based approach, so it is likely to be much faster.

+1

salva Jan 05 '11 at 8:35

source share

From the description of your problem, it sounds like this: the following can be a good compromise. I described this using an object-oriented language, but it can easily be converted to C using a union type or structure with a type and pointer.

Use the first 16 bits to index an array of objects (size 65536). There are 5 possible objects in this array.

A NONE object means that elements starting with these 16 bits are not in the set
ALL means that all elements starting with 16 bits are in the set
a RANGE object means that all elements with the final 16 bits between the upper and lower bounds are in the set
a SINGLE object means that only one element starting with 16 bits is in the array
the BITSET object handles all other cases from bits 65536 bits.

Of course, you do not need to break into 16 bits, you can configure to reflect the statistics of your set. In fact, you do not need to use consecutive bits, but it speeds up the folding of bits, and if many of your elements are consecutive, as you say, this will give good properties.

Hope this makes sense, please comment if I need to explain in more detail. Effectively, you have combined a depth 2 binary tree with ranges and bitrate to trade off between time and speed. If you need to save memory, make the tree deeper with a corresponding slight increase in search time.

+1

Nick fortescue Jan 05 '11 at 15:30

source share

Matthieu M. · Accepted Answer · 2011-01-05T09:50:37+0000

Regarding the second problem:

You can watch Bloom Filters . Bloom Filters are specifically designed to answer the question about membership in O (1), although the answer is either no or maybe (which is not as transparent as yes / no: p).

In the case of maybe , of course, you need further processing to actually answer the question (if in your case there is no probabilistic answer), but even then Bloom Filter can act as a gatekeeper and reject most requests directly.

In addition, you may want to keep the actual ranges and degenerate ranges (individual elements) in different structures.

individual items can be better stored in a hash table
valid ranges can be stored in a sorted array

This reduces the number of elements stored in the sorted array, and therefore the complexity of the binary search performed there. Since you claim that many ranges are degenerate, I assume that you only have 500-1000 ranges (i.e., an order of magnitude smaller) and log (1000) ~ 10

Therefore, I suggest the following steps:

Flower filter: if not, stop
Sorted array of real ranges: if yes, stop
The hash table of individual elements

The Sorted Array test is performed first, because of the number you give (millions of numbers combined in several thousand ranges), if the number is contained, most likely it will be in the range, and not alone :)

One final note: beware of O (1), although this may seem attractive, you are not in the asymptotic case here. Quite a few 5000-10000 ranges, since log (10000) is something like 13. Therefore, do not pessimize your implementation by obtaining a solution O (1) with such a high constant coefficient that it actually works slower than O ( log N) solution :)

Data structure for constructing and searching for a set of integer ranges

More articles: