Data structure for quickly checking numbers with overlapping ranges

Say I have the following ranges of numbers:

0-500 0-100 75-127 125-157 130-198 198-200

Now, let's say, I need to be able to check any given number and see what ranges it is in. What data structure will be used most effectively to say that, for example, the number 100 belongs to the ranges 0-500, 0-100 and 75-127? I just want the binary tree to keep the initial values? In this case, would it be advisable for each node in the tree to hold several objects containing each range at this starting point?

Please note that I only need to get for this specific application, I really do not see that I need to change it in the middle of the process, so the search speed is by far my priority.

Thanks!

+6
source share
4 answers

You need an interval tree . Interval trees are a fairly general concept and contain several different data in nodes for each problem. In your case, each node will contain a list of input intervals that span the node interval.

+4
source

Let R denote the number of possible ranges. (For your example, R=6 )

Create R hash tables so that each hash table can contain only one range. For your example, you need to create 6 hash tables. The first hash table R1 will contain only the values ​​0-500.

Pour the hash table.

Each number will go to the corresponding hash table. For your example, number 100 will go to R1 , R2 , R3 . If R large, then you will need to create many hash tables. But then the total space is limited by the actual data stored in all the hash tables assembled together.

reading:

For any given number, check to see if it is present in each of the R hash tables. You can further optimize by choosing which hash tables should look like. For example, for 100 you only need to examine 3 of the 6 hash tables.

Time complexity:

Searching for a single value in a hash table takes constant time (average). So O(1) Amortized to find out if a number is present in the hash table.

O(R) amortized to get the result, since we need to look at all the hash tables to output the output

+2
source

Edited by:

  • Much faster
  • Much more efficient memory consumption.
  • For 1000 inputs, this ensures that it works in less than 100 ms in the worst cases.
  • For 10,000 input ranges, 10x is larger than the required input, it still works in less than 100ms for 50 duplicates from ex: {0 - #} x50, {1 - #} x50, etc. for all from ranges.

goal

Given the 10,000 input interval intervals (From - To - To make the worst case scenario, each "of" is duplicated 50 times). The program takes one target value and displays all the ranges to which the target belongs.

An example of how it works for 4 ranges:

Given the following ranges:

 0 - 100 0 - 500 50 - 500 20 - 300 

Target: 40

Output

 20 - 300 0 - 500 0 - 100 

Algorithm explanation (using the previous example):

Each from value is mapped to index++ , starting at index = 0 . So in our example:

 From => index 0 => 0 50 => 1 20 => 2 

Now, having an array of tree sets called pointers , each index i this array references the key values ​​of i in the tree set. So, pointers[0] refers to 'from': 0, pointers[1] refers to 'from': 50, pointers[2] refers to 'from': 20.

Now we add the to values to each from value, looking at the map of the tree 'key' => 'value' , where value is the key index in the pointers array.

In addition, we want the to values to added to the pointers array at each index, which should be sorted in descending order (I will explain why later).

So now the pointer will look like this:

 index => TreeSet[values..] 0 => 500 | 100 1 => 500 2 => 300 

Now we are ready to get the ranges to which the target belongs.

For target = 40

1 - Find the closest 40 floor key on the tree map. The program believes that 20 is the closest.

2 - It falls into the index corresponding to 20 in the array of pointers. To get index 20: look at the tree map with key 20. Index 20 is 2.

3 - So, go to pointers[2] , he discovers that there is a number 300.

4 - Now the application checks that 300 is less than the target, the reason why it does this is because I already mentioned that each created set of trees is sorted in descending order. Therefore, if the value 300 is less than the target, therefore, it is not necessary to continue checking the following values ​​in pointers[2] , because they guarantee that they are less.

5 - In this case, 300 is larger than the target, then type the key as from and pointer[2] {current element} as to .

6 - Since only one element is found in pointers[2] , the for loop ends and the key is deleted from the tree map, so the next time the program wants to find the next closest floor key, it will find the next one if it exists.

7 - (Next iteration of the while loop) The next key found after deleting key 20 is 0, with index 0 according to the tree map.

8 - Go to pointers[0] . He finds that the number of elements in the tree set in pointers[0] is 2.

9 - Start with the first element pointers[0] {first element} . Is 500 less than 40? No, type "range." Next item, 100 is less than 40? No, type "range." There are no more items to loop.

10 - Remove key 0 from the tree map.

11 - Now the condition of the While loop checks if the closest key exists for the target. No, because 0 and 20 have been removed. Therefore, the condition is not met, exit the while loop.

12 - Printing time and final program :)

We hope you find a helpful explanation.

code

 import java.util.Collections; import java.util.TreeMap; import java.util.TreeSet; public class Interval { public static void main(String[] args) { int MAX_SIZE = 10000; int[] from = new int[MAX_SIZE]; int[] to = new int[MAX_SIZE]; //Generate 10,000 (from - to), with 50 redundant (from range) for every value of from. (to make the application heavy) int c = 0; for(int i=0; i<MAX_SIZE;i++){ from[i] = 0+c; to[i] = from[i] + (int)(Math.random()*100); if(i%50 == 0) c++; } //Start time counting long time = System.currentTimeMillis(); int target = 97; TreeMap<Integer, Integer> treePointer = new TreeMap<Integer, Integer>(); // will sotre <From, index++> int index = 0; int size = from.length; TreeSet<Integer>[] pointers = new TreeSet[size]; //Array of tree set to store values of every "from" range //insert into tree for(int i=0; i<from.length;i++){ if(!treePointer.containsKey(from[i])){ //if the "from" does not exist in the tree yet, insert it treePointer.put(from[i], index); pointers[index++] = new TreeSet<Integer>(Collections.reverseOrder()); //sort descending order } //index of "from" in the pointers array int virtualIndex = treePointer.get(from[i]); //add the 'to' element to the corresponding index of "from" in Tree Set at pointers[index of "from"] pointers[virtualIndex].add(to[i]); } // Display part of the pointers array to understand how elements are stored // for(int i=0; i<10; i++){ // for(int current : pointers[i]){ // System.out.print(current + " "); // } // System.out.println(); // } //Start checking for the ranges Integer currentKey = -1; //dummy number at first while((currentKey = treePointer.floorKey(target)) != null){ // while there is a closest floor key //get index assigned to the closest floor key int virtualIndex = treePointer.get(currentKey); //loop on the elements found at pointers[index of closest floor number] for(int toValue : pointers[virtualIndex]){ if(toValue < target) //remember the values are sorted in a descending order, so whenever the value becomes smaller than the target don't continue the for loop break; System.out.println(currentKey + " - " + toValue); // else target less or equal to toValue, so print range } treePointer.remove(currentKey); //remove key from tree to fetch the next floor key in the next while iteration } //Display time consumed in ms System.out.println("\nTotal time: " + (System.currentTimeMillis() - time) + " ms"); } } 
+1
source

Assuming you have enough memory, I would use an array of lists where the list has all the ranges containing the index.

Here is a Python solution showing the algorithm in more detail:

 # (Inclusive) ranges ranges = [(0,500), (0,100), (75,127), (125,157), (130,198), (198,200)] smallest = min(r[0] for r in ranges) largest = max(r[1] for r in ranges) # Ceate table table = [[] for i in range(smallest, largest+1)] # List of lists for r in ranges: # pre-compute results mn, mx = r for index in range(mn, mx+1): table[index - smallest].append(r) def check(n): 'Return list of ranges containing n' if smallest <= n <= largest: return table[n - smallest] else: return [] # Out of range for n in [-10, 10, 75, 127, 129, 130, 158, 197, 198, 199, 500, 501]: print('%3i is in groups: %r' % (n, check(n))) 

Output:

 -10 is in groups: [] 10 is in groups: [(0, 500), (0, 100)] 75 is in groups: [(0, 500), (0, 100), (75, 127)] 127 is in groups: [(0, 500), (75, 127), (125, 157)] 129 is in groups: [(0, 500), (125, 157)] 130 is in groups: [(0, 500), (125, 157), (130, 198)] 158 is in groups: [(0, 500), (130, 198)] 197 is in groups: [(0, 500), (130, 198)] 198 is in groups: [(0, 500), (130, 198), (198, 200)] 199 is in groups: [(0, 500), (198, 200)] 500 is in groups: [(0, 500)] 501 is in groups: [] 

Further optimizations are possible, such as using a bit-set to store which ranges are available for each index, not a list.

In the comments below, I decided to modify the above to select the table search method above and a slower, but significantly more efficient memory solution based on direct comparison. (Ideally, an interval tree based solution would also be included):

 # (Inclusive) ranges ranges = [(0,500), (0,100), (75,127), (125,157), (130,198), (198,200)] limit = 1000000 # Or whatever smallest = min(r[0] for r in ranges) largest = max(r[1] for r in ranges) if (largest - smallest) * len(ranges) < limit: # Ceate table table = [[] for i in range(smallest, largest+1)] # List of lists for r in ranges: mn, mx = r for index in range(mn, mx+1): table[index - smallest].append(r) def check(n): 'Return list of ranges containing n' if smallest <= n <= largest: return table[n - smallest] else: return [] # Out of range else: # mpre emory efficient method, for example def check(n): return [(mn, mx) for mn, mx in ranges if mn <= n <= mx] for n in [-10, 10, 75, 127, 129, 130, 158, 197, 198, 199, 500, 501]: print('%3i is in groups: %r' % (n, check(n))) 
+1
source

All Articles