Data structure to support a particular query in a set of two-dimensional points

I have a set of two-dimensional points, and I want to be able to make the following query with arguments x_min and n : what are the n points with the largest y that have x > x_min ?

To rephrase in Ruby:

 class PointsThing def initialize(points) @points = points end def query(x_min, n) @points.select { |point| point.x > x_min }.sort_by { |point| point.y }.take(n) end end 

Ideally, my class will also support insert and delete operations.

I cannot imagine a data structure for this that would allow a query to run less O time (| @points |). Does anyone know someone?

+5
source share
4 answers

Sort points in descending order. For each point, in order, insert it into a purely functional red-black tree, sorted in descending order. Store all intermediate trees in an array.

To find a specific x_min, use binary search to find an intermediate tree in which points with x> x_min were inserted. Go through this tree to find the first n points.

The cost of pre-processing is O (p log p) in time and space, where p is the number of points. The request time is O (log p + n), where n is the number of points returned in the request.

+2
source

If your data is not sorted, then you have no choice but to check each point, since you cannot know if there is another point for which y is greater than all other points and for which x > x_min . In short: you cannot know if another item should be included unless you check all of them.

In this case, I would suggest that at your request it is impossible to check the linear time sub , since you have to check all of them. The best case for finding everyone would be linear.

If your data is sorted, then your best case will be constant (all n points are with the highest y), and the worst case will be linear (all n points are the smallest y). The middle case will be closer to constant, I think if your x and x_min are roughly random in a certain range.

If you want this to scale (i.e. you could have large n values), you will also want to save your result set, since you will need to check for new potential points against it and reset the lowest value when you insert (if the size > n). Using a tree, it can be log time.

So, to do everything, the worst case is for unsorted points, in which case you are looking at nlog (n) time. Sorted points are better, and in this case you are looking at the average case of log (n) time (again, assuming roughly randomly distributed values ​​for x and x_min), which is sublinear.


If at first it is not clear why the sorted points will have a constant time to search, I will quickly go here.

If the n points with the highest y values ​​had x > x_min (the best case), then you are just clutching at what you need from above, so this is obvious.

For the middle case, assuming a roughly random distribution of x and x_min, the probability that x > x_min is basically half. For any two random numbers a and b, a > b is as true as b > a . This is the same with x and x_min; x > x_min is equally true as x_min > x , which means 0.5 probability. This means that for your points, on average, each verified second will meet your requirement x > x_min , so on average you will check 2n points to find the n highest points that match your criteria. Thus, the best case was time c, the average value is 2c, which is still constant.

Note, however, that for n values ​​approaching the size of the set, this hides the fact that you are going through the entire set, basically returning it back to linear time. Therefore, my statement that this is a constant time does not hold if you accept random n values ​​within the range of your set.

If this is not a purely academic issue and is caused by some real need, then it depends on the situation.

(edit) I just realized that my statements about constant time presuppose a data structure in which you have direct access to the highest value and can consistently go to lower values. If the data structure provided to you does not match this description, then obviously this is not so.

+1
source

In this case, some pre-processing may help.

First, divide the set of points that take x_min as a rotation element.

Then for the set of points located on the right side of x_min , create max_heap based on the y coordinates.

Now run your query as: Perform n extract_max operations on the built-in max_heap .

The execution time of your request will be log X + log (X-1) + ..... log (X- (n-1))

log X : for the first operation max max.

log X-1 : for the second operation max max, etc.

X : size of the original maximum heap.

Even in the worst case, when your n <X . The accepted time will be O (n log X) .

+1
source

notation

Let P be the set of points.

Let top_y ( n, x_min) describe a query to collect n points from P with the largest y-coordinates among those whose x-coordinate is greater than or equal to `x_min '.

Let x_0 be the minimum x coordinates in your set of points. Divide the x axis to the right of x_0 by the set of left closed right open intervals I_i by the set x of the coordinates of the set of points P such that min(I_i) is the i th, but the smallest coordinate x of P Determine the coordinate rank r(x) of x , since the index of the interval x is an element, or 0 if x < x_0 .

Note that r(x) can be computed in O(log #({I_i})) using a binary search tree.

A simple solution

  • Sort your point by decreasing y-coordinates and save this array A in time O(#P log #P) and space O(#P) .

  • Process each request top_y ( n, x_min ) , going through this array in order, skipping items A_i: A_i.x < x_0 , counting all the other entries until the counter reaches n , or you are at the end of A This processing takes time O(n) and O(1) .

Please note that this may already be enough: Queries top_y ( n_0, a_0 ); a_0 < min { px | p \in P }, n_0 = c * #P, c = const top_y ( n_0, a_0 ); a_0 < min { px | p \in P }, n_0 = c * #P, c = const top_y ( n_0, a_0 ); a_0 < min { px | p \in P }, n_0 = c * #P, c = const require step 1 in any case, and for n << #P and "infrequent" queries, further optimizations were not worth the effort.

Observation

  • Consider the sequences s_i, s_ (i + 1) of points with x-coordinates greater than or equal to min (I_i), min (I_ (i + 1)) , ordered by decreasing y-coordinate. s_ (i + 1) is a strict subsequence of s_i`.

  • If p_1 \in s_(i+1) and p_2.x >= p_1.x , then p_2 \in s_(i+1) .

Refined Solution

The refined data structure allows O(n) + O(log #P) query processing times.

Annotate array A from a simple “post-send” solution for the same A_i elements with A_(i+1).x < A_i.x ; This send data will consist of an array of disp:[r(A_(i+1).x) + 1 .. r(A_i.x)] of A indexes of the next element in A , whose coordinates are x-coordinates, at least are as high as the index in disp . These send indexes are enough to process the request, because ...

  • ... disp[j] = disp[r(A_(i+1).x) + 1] for each j <= r(A_(i+1).x) .
  • ... for any x_min with r(x_min) > r(A_i.x) algorithm will not be here

The native index for access to disp is r(x_min) , which remains constant throughout the query and, therefore, takes O(log #P) to calculate once for each query, while the choice of index O(1) in each element A .

disp can be precomputed. There are no two disp entries for all disp arrays are identical (Proof is omitted, but it is easy [;-)] to see this construct). Therefore, the construction of disp arrays can be performed using the stack in one scan through a set of points sorted in A Since there are #P entries, the disp structure takes O(#P) space and O(#P) construction time, in which space and time requirements for y-sorting prevail. Therefore, in a sense, this structure is provided free of charge.

Time requirements for request top_y(n,x_min)

  • Calculation r(x_min) : O(log #P) ;
  • Passing through A : O(n) ;
+1
source

All Articles