Find a period in the long run of a periodic sequence

Short description.

I have a sequence of numbers [0, 1, 4, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7, 0, 0, 1, 1, 2, 3, 7] . As you can see, from the 3rd value, the sequence is periodic with the period [0, 0, 1, 1, 2, 3, 7] .

I am trying to automatically extract this period from this sequence. The problem is that I do not know the length of the period and do not know from what position the sequence becomes periodic.

Full explanation (some math may be required)

I study combinatorial game theory, and the cornerstone of this theory is to calculate the Grundy values ​​of a game chart. This leads to an infinite sequence, which in many cases becomes ultimately periodic .

I found a way to efficiently calculate grundy values ​​(it returns me the sequence). I would like to automatically extract the offset and period of this sequence. I know that when you see part of the sequence [1, 2, 3, 1, 2, 3] , you cannot be sure that [1, 2, 3] is a period (who knows, maybe the next number is 4 , which violates the assumption), but I'm not interested in such subtleties (I assume that the sequence is enough to find the real period). Also the problem is that the sequence may stop in the middle of the period: [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, ...] (the period is still 1, 2, 3 ).

I also need to find the smallest offset and period. For example, for the original sequence, the offset can be [0, 1, 4, 0, 0] and the period [1, 1, 2, 3, 7, 0, 0] , but the smallest is [0, 1, 4] and [0, 0, 1, 1, 2, 3, 7] .


My ineffective approach is to try all possible offsets and all possible periods. Build a sequence using this data and check if it matches the original. I did not do any normal analysis, but it looks like it is at least quadratic in terms of time complexity.

Here is my quick python code (did not check it correctly):

 def getPeriod(arr): min_offset, min_period, n = len(arr), len(arr), len(arr) best_offset, best_period = [], [] for offset in xrange(n): start = arr[:offset] for period_len in xrange(1, (n - offset) / 2): period = arr[offset: offset+period_len] attempt = (start + period * (n / period_len + 1))[:n] if attempt == arr: if period_len < min_period: best_offset, best_period = start[::], period[::] min_offset, min_period = len(start), period_len elif period_len == min_period and len(start) < min_offset: best_offset, best_period = start[::], period[::] min_offset, min_period = len(start), period_len return best_offset, best_period 

Which returns me what I want for my original sequence:

 offset [0, 1, 4] period [0, 0, 1, 1, 2, 3, 7] 

Is there something more effective?

+7
math algorithm sequence
source share
3 answers
  • I would start by plotting the histogram of values ​​in a sequence

    So, you just make a list of all the numbers used in the sequence (or a significant part) and consider their appearance. This is O(n) , where n is the size of the sequence.

  • sort histogram ascending

    This is O(m.log(m)) , where m is the number of different values. You can also ignore unlikely numbers ( count<treshold ), which are likely to be offset or only irregularities, which are then omitted by m . For periodic sequences, m <<< n so that you can use it as the first marker if the sequence is periodic or not.

  • find out the period

    In the histogram , counts should be approximately a multiple of n/period . So, an approximate / find GCD from among the histograms. The problem is that you need to consider that there are irregularities in the calculations, as well as in n (the offset part), so you need to calculate the GCD . eg:

     sequence = { 1,1,2,3,3,1,2,3,3,1,2,3,3 } 

    has an ordered histogram:

     item,count 2 3 1 4 3 6 

    the GCD(6,4)=2 and GCD(6,3)=3 you should check at least +/-1 around the GCD results so that the possible periods are around:

     T = ~n/2 = 13/2 = 6 T = ~n/3 = 13/3 = 4 

    So check T={3,4,5,6,7} to be sure. Always use a GCD between the highest number of samples and the minimum values. If the sequence has many different numbers, you can also make a histogram of counters that check only the most common values.

    To check the expiration date, simply take any element closer to the end or middle of the sequence (just use the probable periodic region). Then look for it in a close area near the likely period before (or after) its occurrence. If you find several times, you have received the correct period (or several)

  • Get the exact period

    Just check the found fragments of the period ( T/2, T/3, ...) or run a histogram for the found period, and the smallest count will tell you how many real periods you received encapsulated, so divide them by.

  • find offset

    When you know the period, it is easy. Just scan from the beginning, take the first element and see if there is a period after that. If I do not remember the position. Stop at the end or in the middle of a sequence ... or on some subsequent successes. This is before O(n) And the last memorized position is the last element in offset .

[edit1] It was curious, so I'm trying to encode it in C ++

I simplified / skipped a few things (assuming at least half of the array is periodic) to check if I made some stupid mistake in my algorithm and here is the result (it works as expected):

 const int p=10; // min periods for testing const int n=500; // generated sequence size int seq[n]; // generated sequence int offset,period; // generated properties int i,j,k,e,t0,T; int hval[n],hcnt[n],hs; // histogram // generate periodic sequence Randomize(); offset=Random(n/5); period=5+Random(n/5); for (i=0;i<offset+period;i++) seq[i]=Random(n); for (i=offset,j=i+period;j<n;i++,j++) seq[j]=seq[i]; if ((offset)&&(seq[offset-1]==seq[offset-1+period])) seq[offset-1]++; // compute histogram O(n) on last half of it for (hs=0,i=n>>1;i<n;i++) { for (e=seq[i],j=0;j<hs;j++) if (hval[j]==e) { hcnt[j]++; j=-1; break; } if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; } } // bubble sort histogram asc O(m^2) for (e=1,j=hs;e;j--) for (e=0,i=1;i<j;i++) if (hcnt[i-1]>hcnt[i]) { e=hval[i-1]; hval[i-1]=hval[i]; hval[i]=e; e=hcnt[i-1]; hcnt[i-1]=hcnt[i]; hcnt[i]=e; e=1; } // test possible periods for (j=0;j<hs;j++) if ((!j)||(hcnt[j]!=hcnt[j-1])) // distinct counts only if (hcnt[j]>1) // more then 1 occurence for (T=(n>>1)/(hcnt[j]+1);T<=(n>>1)/(hcnt[j]-1);T++) { for (i=n-1,e=seq[i],i-=T,k=0;(i>=(n>>1))&&(k<p)&&(e==seq[i]);i-=T,k++); if ((k>=p)||(i<n>>1)) { j=hs; break; } } // compute histogram O(T) on last multiple of period for (hs=0,i=nT;i<n;i++) { for (e=seq[i],j=0;j<hs;j++) if (hval[j]==e) { hcnt[j]++; j=-1; break; } if (j>=0) { hval[hs]=e; hcnt[hs]=1; hs++; } } // least count is the period multiple O(m) for (e=hcnt[0],i=0;i<hs;i++) if (e>hcnt[i]) e=hcnt[i]; if (e) T/=e; // check/handle error if (T!=period) { return; } // search offset size O(n) for (t0=-1,i=0;i<nT;i++) if (seq[i]!=seq[i+T]) t0=i; t0++; // check/handle error if (t0!=offset) { return; } 

The code is still not optimized. For n=10000 it takes about 5ms on my setup. The result is t0 (offset) and T (period). You may have to play a bit with treshold constants

+4
source share

Note : if there is a period P1 with length L, then there is also a period P2 with the same length , L, so that the input sequence ends exactly with P2 (i.e. we do not have a partial period involved at the end).

Indeed, a different period of the same length can always be obtained by changing the offset. The new period will be the turn of the initial period.

For example, the following sequence has a period of length 4 and an offset of 3:

0 0 0 (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2 3 4) (1 2

but also has a period with the same length of 4 and an offset of 5 without a partial period at the end:

0 0 0 1 2 (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2) (3 4 1 2)


The implication is that we can find the minimum period length by processing the sequence in reverse order and look for the minimum period using zero offset from the end. One possible approach is to simply use your current algorithm in the reverse list, without requiring a loop for offsets.

Now that we know the length of the required period, we can also find its minimum offset. One possible approach is to try all the different offsets (with the advantage that a length loop is not required, since the length is known), but further optimizations are possible if necessary. promoting as much as possible when processing the list from the end, allowing you to make the final repetition of the period (that is, closest to the beginning without the reverse sequence) partial.

+5
source share

I needed to do something similar once. I used brute force and common sense, the solution is not very elegant, but it works. The solution always works, but you must set the correct parameters (k, j, con) in the function.

  • The sequence is stored as a list in the seq variable.
  • k is the size of the sequence array, if you think your sequence will take a long time to become periodic, then set this k to a large number.
  • The variable found will tell us whether the array passed the periodic test with period j
  • j is the period.
  • If you expect a long period, you must set j a large number.
  • We check the frequency by checking the last digits of j + 30 sequences.
  • The longer the period ( j ), the more we must check.
  • As soon as one of the tests passes, we will exit the function and return a shorter period.

As you can see, the accuracy depends on the variables j and k , but if you set them to very large numbers, this will always be correct.

 def some_sequence(s0, a, b, m): try: seq=[s0] snext=s0 findseq=True k=0 while findseq: snext= (a*snext+b)%m seq.append(snext) #UNTIL THIS PART IS JUST TO CREATE THE SEQUENCE (seq) SO IS NOT IMPORTANT k=k+1 if k>20000: # I IS OUR LIST INDEX for i in range(1,len(seq)): for j in range(1,1000): found =True for con in range(j+30): #THE TRICK IS TO START FROM BEHIND if not (seq[-i-con]==seq[-ij-con]): found = False if found: minT=j findseq=False return minT except: return None 

simplified version

 def get_min_period(sequence,max_period,test_numb): seq=sequence if max_period+test_numb > len(sequence): print("max_period+test_numb cannot be bigger than the seq length") return 1 for i in range(1,len(seq)): for j in range(1,max_period): found =True for con in range(j+test_numb): if not (seq[-i-con]==seq[-ij-con]): found = False if found: minT=j return minT 

Where max_period is the maximum period you want to find, and test_numb is how many numbers of the sequence you want to test, the more, but it is better to do max_period + test_numb <LEN (sequence)

0
source share

All Articles