Can I make an O (1) search algorithm using a sorted array with a known step?

Background

my software visualizes very large datasets, for example. the data is so big. I can’t store all the data in RAM at any time when it should be loaded on the page. I am implementing matplotlib functionality to display and process the graph in the backend of my application.

These datasets contain three internal lists that I use for visualization: time , height and dataset . My program displays data with time x height , and also users have options for drawing shapes around areas of the graph that can be extracted to a completely different graph. p>

The hard part is that when I want to extract data from the figures, the vertices of the form are the real coordinates calculated on the graph, and not rounded to the nearest point in my time array. Here is an example of a form that limits the area in my program

enter image description here

While X1 can represent a coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the nearest coordinate existing in time and height can be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit from the matploblib coordinate.


Problem

So, a given arbitrary value, say x1 = 732839.154395 (representing the date in number format) and a list of similar values ​​with a constant step:

 732839.154392 732839.154392 732839.154393 732839.154393 732839.154394 732839.154394 732839.154395 732839.154396 732839.154396 732839.154397 732839.154397 732839.154398 732839.154398 732839.154399 etc... 

What would be the most efficient way to find the closest representation of this point? I could just iterate over the list and get the value with the smallest values, but the size of time huge . Since I know that the array is 1. Sorted and 2. Constant-increment increments, I thought this problem should be solved in O(1) time? Is there a known algorithm that solves these problems? Or I just need to develop some kind of custom algorithm, here is my current thinking process.

 grab first and second element of time subtract second element of time with first, obtain step subtract bounding x value with first element of time, obtain difference divide difference by step, obtain index move time forward to index check surrounding elements of index to ensure closest representation 
+5
source share
3 answers

The algorithm you propose seems reasonable and as if it will work.

As it became clear in your comments, the problem with this is the rudeness with which your time was recorded. (This may be a common occurrence when unsynchronized data is recorded, i.e. the clock for data synchronization, for example, the frame rate, is not synchronized with the computer).

Easy way: read two points separated by a long time, for example, read the first time value, and then the 1000th time value . Then everything remains the same in your calculations, but you get the timestep by subtracting and then dividing by 1000

Here's a test that makes the data look like yours:

 import matplotlib.pyplot as plt start = 97523.29783 increment = .000378912098 target = 97585.23452 # build a timeline times = [] time = start actual_index = None for i in range(1000000): trunc = float(str(time)[:10]) # truncate the time value times.append(trunc) if actual_index is None and time>target: actual_index = i time = time + increment # now test intervals = [1, 2, 5, 10, 100, 1000, 10000] for i in intervals: dt = (times[i] - times[0])/i index = int((target-start)/dt) print " %6i %8i %8i %.10f" % (i, actual_index, index, dt) 

Result:

  span actual guess est dt (actual=.000378912098) 1 163460 154841 0.0004000000 2 163460 176961 0.0003500000 5 163460 162991 0.0003800000 10 163460 162991 0.0003800000 100 163460 163421 0.0003790000 1000 163460 163464 0.0003789000 10000 163460 163460 0.0003789100 

That is, as the space between the selected points increases, the estimate of the time interval becomes more accurate (compared with increment in the program), and the evaluation index (third column) approaches the actual index (2nd column). Please note that the accuracy of the dt estimate is mainly proportional to the number of digits in the span. The best you can do is use the time at the start and end points, but of you, it seemed hard to say that it would be difficult; but if it is not, it will give the most accurate estimate of your time interval. Please note that here, for clarity, I exaggerated the lack of accuracy, making my time interval very good, but in general, every force of 10 in your range increases your accuracy by the same amount.

As an example of this last point, if I reduce the fluidity of the time values ​​by changing the line of the trunc = float(str(time)[:12]) , I get:

  span actual guess est dt (actual=.000378912098) 1 163460 163853 0.0003780000 10 163460 163464 0.0003789000 100 163460 163460 0.0003789100 1000 163460 163459 0.0003789120 10000 163460 163459 0.0003789121 

So, if, as you say, using range 1 comes close to you, using range 100 or 1000 should be more than enough.

Overall, this is very similar to the idea of ​​a linear "interpolation search". This is a little easier to implement because it makes only one assumption based on interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))

+9
source

What you are describing is pretty much a search for interpolation . It is very similar to binary search, but instead of choosing the middle element, it assumes that the distribution is close to uniform and suggests an approximate location.

The wikipedia link contains a C ++ implementation.

+5
source

This is what you did, it is actually finding the value of the nth element of the arithmetic sequence, given the first two elements. This, of course, is good.

Besides the real question, if you have so much data that you cannot fit in ram, you can configure something like Files with memory or just by creating virtual memory files in Linux called swap .

0
source

All Articles