I have a large sequence of tuples on a disk in the form (t1, k1) (t2, k2) ... (tn, kn)
ti is a monotonically increasing timestamp, and ki is the key (if necessary, accept a string with a fixed length). Neither tee nor ki are guaranteed to be unique. However, the number of unique tis and kis is huge (millions). n is very large (100 million +), and the size of k (about 500 bytes) makes it impossible to store everything in memory.
I would like to know the periodic occurrences of keys in this sequence.
For example, if I have the sequence (1, a) (2, b) (3, c) (4, b) (5, a) (6, b) (7, d) (8, b) (9, a) (10, b)
The algorithm should emit (a, 4) and (b, 2). This occurs with period 4 and b occurs with period 2.
If I create a hash of all keys and save the average value of the difference between successive timestamps of each key and its deviation std, I could skip and report only those that have acceptable std deviation (ideally, 0). However, each unique key requires one bucket, while in practice I may have very few truly periodic patterns. Any better ways?
algorithm
Miner
source share