Filtering a subset of (potentially) 1,000,000+ elements

I have a large dataset , possibly over a million records. All elements have an assigned time stamp, and the elements are added to the set at runtime (usually, but not always, with a newer time stamp). I need to show a subset of this data with a specific time range. This time range is usually quite small compared to the general data set, that is, out of 1,000,000+ elements not exceeding about 1,000 in a given given time interval. This time range moves at a constant pace, for example. every second the time range moves by one second. In addition, the user can adjust the time range at any time ("move" through the data set) or set additional filters (for example, a filter by text).

So far, I have not worried about performance, trying to understand that everything is in order, and worked only with smaller test suites. I am not quite sure how to solve this problem effectively and I will be happy for every contribution. Thanks.

Edit: The language used is C # 4.

Update: now I use the interval tree, an implementation can be found here: https://github.com/mbuchetics/RangeTree

It also comes with an asynchronous version that restores the tree using a parallel task library (TPL).

+4
source share
3 answers

We had a similar problem in our development - we had to collect several million items sorted by some key, and then export one page on demand upon request. I see that your problem is somehow similar.

For this purpose, we adapted the structure of red-black wood in the following ways:

  • we added an iterator to it so that we can get the β€œnext” element in o (1)
  • we added an iterator search from the "index" and were able to do this in O (log n)

The RB> tree has an O (log n) input complexity, so I think your inserts will fit in well with it.

next() on the iterator was implemented by adding and maintaining a linked list of all leaf nodes - our original accepted implementation of the RB tree implemented "t enable this."

The RB> tree is also cool, because it allows you to fine-tune the size of the node to suit your needs. By experimenting, you will be able to determine the correct numbers that simply correspond to your problem.

+3
source

Use a SortedList sorted by timestamp.

All you need to do is implement a binary search in the sorted keys inside the sorted list to find the border of your choice, which is pretty simple.

+1
source

Insert new items into the sorted list. This will allow you to easily select a range. You can also use linq if you are familiar with it.

0
source

All Articles