Search for the number of parallel events taking into account the start and end times

I have a massive (~ 10 9 ) set of events, characterized by a start and end time. Given some time, I want to find out how many of these events were at that time.

What data structure would be useful in this situation? Operations that need to be completed quickly:

  • Insert new events, for example, {start: 100000 milliseconds, end: 100010 milliseconds} .
  • Request the number of simultaneous events at a given time.

Update: someone put a flag on computational geometry on it, so I think I should rephrase this from the point of view of computational geometry. I have a set of one-dimensional intervals, and I want to calculate how many of these intervals intersect with a given point. Inserting new intervals should be quick.

+7
source share
4 answers

You are looking for an interval tree.

  • Construction: O(n log n) , where n is the number of intervals
  • Query: O(m+log n) , where m is the number of query results, and n is the number of intervals
  • Space: O(n)
+7
source

To add to other answers, depending on the length of time and the desirability of detailing, you can simply have an array of counters. For example, if the time duration is 24 hours and the desired granularity is 1 millisecond, there will be 86,400,000 cells in the array. If you use one 4 bytes of int per cell (this is enough to hold 10 ^ 9), this will be less than 700 MB of RAM compared to tree-based solutions that would occupy at least (8 + 8 + 4 + 4) * 10 ^ 9 = 24 GB of RAM for two pointers plus two ints for the node tree (since 32 bits of address memory are insufficient, you need 64 bits per pointer). You can use swap, but this will significantly slow down some requests.

You can also use this solution if you only care about the last 24 hours of data, for example, using an array as a circular buffer. Besides the time limit and granularity, another drawback is that the interval insertion time is proportional to the interval length, so if the interval length is unlimited, you may have problems. Queries, on the other hand, are a single array search.

+2
source

(Extension of tskuzzy and Snowball answers)

A balanced binary search tree makes sense, except that the memory requirements will be excessive for your dataset. A B-tree will be much more efficient in terms of memory, albeit more complex if you cannot use the library.

Save two trees, one of the start time and one of the end time. To insert an event, add a start time to the start time and end time tree of the end time tree. To query the number of active events at time T, find the start tree to find out how many start times are less than T, and find the end time tree to see how many end times are less than T. Subtract the number of end times from the number of start times and the number of active events .

Insertions and queries must take O (log N) time.

A few comments:

  • As you formulated the question, you only care about the number of active events, and not about what events were active. This means that you don’t need to keep track of what start time goes with what end time! It also facilitates the deletion of the word β€œ+ M” in queries in previous answers.

  • Observe the exact semantics of your request. In particular, an event is considered active at time T if it starts at time T? If it ends at time T? Answers to these questions affect your use of <or <= in specific places.

  • Use a β€œset” not data structure because you almost certainly want to allow and count duplicates. That is, more than one event may begin and / or end simultaneously. A set usually ignores duplicates. Instead, you are looking for a "multiset" (sometimes called a "bag").

  • Many binary search trees do not support the "number of elements <T" queries out of the box. But this functionality is easy to add, keeping the size in each node.

+2
source

Suppose we have a sorted set (for example, a balanced binary search tree or skip list ) with N elements. Also, suppose the sorted set has O (log N) search time, O (log N) insertion time, and O (N) space usage (these are reasonable assumptions, see red-black tree , for example).

One possibility is to have two sorted sets, bystart and byend , respectively sorted by the start and end time of the events.

To find the number of events that flow at time t , ask byend for the first interval whose end time is longer than t : a search operation O (log N). Call the start time of this left interval. Now ask bystart number of intervals whose start time is greater than or equal to left and less than t . This is O (log N + M), where M is the number of such intervals. So, the total search time is O (log N + M).

The insert was O (log N) for the sorted sets, which we have to do once for each sorted set. This makes the total time for the insert operation O (log N).

The construction of this structure from scratch consists of only N insert operations, so the total time to build is O (N log N).

The use of space is O (N) for each sorted set, so the total use of space is O (N).

Summary:

  • Insert: O (log N), where N is the number of intervals
  • Construction: O (N log N)
  • Query: O (log N + M), where M is the number of results
  • Space: O (N)
0
source

All Articles