Detecting Periodic Patterns in a Large Dataset

I have a large sequence of tuples on a disk in the form (t1, k1) (t2, k2) ... (tn, kn)

ti is a monotonically increasing timestamp, and ki is the key (if necessary, accept a string with a fixed length). Neither tee nor ki are guaranteed to be unique. However, the number of unique tis and kis is huge (millions). n is very large (100 million +), and the size of k (about 500 bytes) makes it impossible to store everything in memory.

I would like to know the periodic occurrences of keys in this sequence.

For example, if I have the sequence (1, a) (2, b) (3, c) (4, b) (5, a) (6, b) (7, d) (8, b) (9, a) (10, b)

The algorithm should emit (a, 4) and (b, 2). This occurs with period 4 and b occurs with period 2.

If I create a hash of all keys and save the average value of the difference between successive timestamps of each key and its deviation std, I could skip and report only those that have acceptable std deviation (ideally, 0). However, each unique key requires one bucket, while in practice I may have very few truly periodic patterns. Any better ways?

+7
algorithm
source share
5 answers

You can use discrete autocorrelation to find periods and then search for keys. The advantages of autocorrelation are that it’s a little easier to understand what is happening in the discrete domain and you don’t have to worry about matching keys with anything, just use the characteristic function of the two keys, which is 1 when they are equal and 0 when they are unequal.

+4
source share

This is more or less the reason Fourier Transforms ( Fast Fourier Transforms , etc.).

You essentially transform the sequence from a time domain (or some similar dimension) into a frequency domain. This is a very old problem preceding the use of computers, and there is a huge amount of theory on this subject. Also see discrete Fourier transform .

EDIT: you would need to convert your k1, k2, ... values ​​somehow, but assuming this is possible, this approach should also be.

+2
source share

If I create a hash of all keys and store the average value of the difference between successive timestamps of each key and std deviation of the same, I can make a pass and report only those that have a valid deviation of std (ideally, 0). However, this requires one bucket in a unique key, whereas in practice I can have very few patterns. Any better ways?

Personally, I think that this is probably the best thing you are going to get if you cannot determine more structure of the problem.

0
source share

Let the symbol (timestamp, string) be designated as ( key , value ). Some limitations: 1. There is a discrete set of values , that is, the coincidence between the periodic occurrences of these values ​​is exact: aaabb ... aaabb, not aaabb ... aaabc. 2. A set of all instances of the value can be set to memory.

Algorithm: 1. Get a complete list of all unique values ​​2. For each unique value, get all tuples by creating an ordered list of timestamps. 3. Apply an algorithm to search for patterns in this data. Ideally, this is an uneven discrete Fourier transform or autocorrelation.

0
source share

You really have two separate problems:

  • you have m different signals in your data defined by unique keys m . You need to separate each signal and save it separately.

  • Given one of these unique signals, you must determine if it is periodic; this is an autocorrelation or discrete Fourier transform application, depending on what you prefer. For example, DFT gives you the coefficients of the interpolation periodic functions of your data. If only one coefficient in the DFT is nonzero, there is a clear period.

If you apply DFT or autocorrelation to data without signal separation, you will get a difficult problem when you don’t know if one of the “periodic” signals of one unique signal or several was found.

0
source share

All Articles