Longest Equidistant Sequence

I have a million integers in sorted order, and I would like to find the longest subsequence where the difference between consecutive pairs is equal. for example

1, 4, 5, 7, 8, 12 

has a subsequence

  4, 8, 12 

My naive method is greedy and just checks how far you can extend the subsequence from each point. It takes O(n²) for each point.

Is there a faster way to solve this problem?

Update. I will check the code provided in the answers as soon as possible (thanks). However, it is already clear that using memory n ^ 2 will not work. There is still no code that ends when you enter as [random.randint(0,100000) for r in xrange(200000)] .

Dates. I tested with the following input on my 32-bit system.

 a= [random.randint(0,10000) for r in xrange(20000)] a.sort() 
  • ZelluX's dynamic programming method uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it only takes 9 seconds! However, it crashes with a memory error on large inputs.
  • Armin's O (nd) method took 9 seconds with pypy, but only 20 MB of RAM. Of course, it would be much worse if the range was much larger. Low memory usage meant that I could also test it with = = [random.randint (0,100000) for r in xrange (200000)], but it didn’t complete in the few minutes that I gave it with pypy.

To test the Kluev I method, run

 a= [random.randint(0,40000) for r in xrange(28000)] a = list(set(a)) a.sort() 

to make a list of lengths of approximately 20000 . All timings with pypy

  • ZelluX, 9 seconds
  • Kluev, 20 seconds
  • Armin, 52 seconds

It seems that if the ZelluX method could be made linear, it would be a clear winner.

+75
python algorithm
Aug 10 '13 at 7:59
source share
10 answers

Update:. The first algorithm described here is deprecated by Armin Rigo's second answer , which is much simpler and more efficient. But both of these methods have one drawback. They need many hours to find the result for a million integers. So I tried two more options (see The second half of the answer), where the range of input integers is considered limited. This limitation allows for much faster algorithms. I also tried to optimize the Armin Rigo code. See the test results at the end.




Here is the idea of ​​an O (N) memory algorithm. The time complexity is O (N 2 log N), but can be reduced to O (N 2 ).

The algorithm uses the following data structures:

  • prev : an array of indices pointing to the previous element of a (possibly incomplete) subsequence.
  • hash : hashmap with key = difference between consecutive pairs in a subsequence and value = two other hashmaps. For these other hashmaps: key = start / end index of the subsequence, value = pair (length of the subsequence, end / start index of the subsequence).
  • pq : priority queue for all possible "difference" values ​​for subsequences stored in prev and hash .

Algorithm:

  • Initialize prev using the i-1 indices. Update hash and pq to register all (incomplete) subsequences found in this step and their "differences."
  • Get (and remove) the smallest "difference" from pq . Get the corresponding entry from hash and scan one of the second level hash cards. At this time, all subsequences with a given "difference" are complete. If the second level hash map contains a subsequence length that is better than what has been found so far, update the best result.
  • In the prev array: for each element of any sequence found in step 2, the decrement index and hash update, and possibly pq . When updating hash we could perform one of the following operations: add a new subsequence of length 1 or grow some existing subsequence by 1 or merge two existing subsequences.
  • Delete the hash map entry found in step 2.
  • Continue from step # 2 and pq not empty.

This algorithm updates O (N) elements prev O (N) times each. And each of these updates may need to add a new “difference” in pq . All this means O (N 2 log N) time complexity if we use a simple heap implementation for pq . To reduce it to O (N 2 ), we could use more complex priority queue implementations. Some of the features are listed on this page: Priority Queues .

See the corresponding Python code on Ideone . This code does not allow duplicate items in the list. This can be fixed, but it would be a good optimization anyway to remove duplicates (and find the longest subsequence outside the duplicates separately).

And the same code after a little optimization . Here, the search ends as soon as the length of the subsequence, multiplied by the possible "difference" of the subsequence, exceeds the range of the list of sources.




The Armin Rigo code is simple and quite efficient. But in some cases, he does some extra calculations that can be avoided. The search can be terminated as soon as the length of the subsequence, multiplied by the possible "difference" of the subsequence, exceeds the range of the list of sources:

 def findLESS(A): Aset = set(A) lmax = 2 d = 1 minStep = 0 while (lmax - 1) * minStep <= A[-1] - A[0]: minStep = A[-1] - A[0] + 1 for j, b in enumerate(A): if j+d < len(A): a = A[j+d] step = a - b minStep = min(minStep, step) if a + step in Aset and b - step not in Aset: c = a + step count = 3 while c + step in Aset: c += step count += 1 if count > lmax: lmax = count d += 1 return lmax print(findLESS([1, 4, 5, 7, 8, 12])) 



If the range of integers in the source data (M) is small, a simple algorithm with time O (M 2 ) and space O (M) is possible:

 def findLESS(src): r = [False for i in range(src[-1]+1)] for x in src: r[x] = True d = 1 best = 1 while best * d < len(r): for s in range(d): l = 0 for i in range(s, len(r), d): if r[i]: l += 1 best = max(best, l) else: l = 0 d += 1 return best print(findLESS([1, 4, 5, 7, 8, 12])) 

It is similar to the first Armin Rigo method, but it does not use any dynamic data structures. I believe the source data has no duplicates. And (to keep the code simple), I also assume that the minimum input value is non-negative and close to zero.




The previous algorithm can be improved if, instead of an array of Boolean elements, we use a data structure of bits and bitwise operations for parallel data processing. The code below implements the bitrate as an embedded Python integer. It has the same assumptions: no duplicates, the minimum input value is non-negative and close to zero. The complexity of time is O (M 2 * log L), where L is the length of the optimal subsequence, the spatial complexity is O (M):

 def findLESS(src): r = 0 for x in src: r |= 1 << x d = 1 best = 1 while best * d < src[-1] + 1: c = best rr = r while c & (c-1): cc = c & -c rr &= rr >> (cc * d) c &= c-1 while c != 1: c = c >> 1 rr &= rr >> (c * d) rr &= rr >> d while rr: rr &= rr >> d best += 1 d += 1 return best 



Landmarks:

Input data (about 100,000 integers) is generated as follows:

 random.seed(42) s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)]))) 

And for the fastest algorithms, I also used the following data (about 1,000,000 integers):

 s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)]))) 

All results show time in seconds:

 Size: 100000 1000000 Second answer by Armin Rigo: 634 ? By Armin Rigo, optimized: 64 >5000 O(M^2) algorithm: 53 2940 O(M^2*L) algorithm: 7 711 
+11
Aug 11 '13 at
source share

We can have an O(n*m) solution in time with very little memory needs, adapting yours. Here n is the number of elements in a given input sequence of numbers, and m is a range, i.e. The largest number minus the smallest.

Call A sequence of all input numbers (and use the precomputed set() to answer in constant time the question "is this a number in A?"). The call d is the step of the subsequence that we are looking for (the difference between the two numbers of this subsequence). For any possible value of d, perform the following linear scanning for all input numbers: for each number n from A in ascending order, if the number was not yet visible, look at the length of the sequence, starting from n, with step d. Then mark all the elements in this sequence, as already seen, so that we will not look for them again, for the same d. Because of this, the complexity is only O(n) for each value of d.

 A = [1, 4, 5, 7, 8, 12] # in sorted order Aset = set(A) for d in range(1, 12): already_seen = set() for a in A: if a not in already_seen: b = a count = 1 while b + d in Aset: b += d count += 1 already_seen.add(b) print "found %d items in %d .. %d" % (count, a, b) # collect here the largest 'count' 

Update:

  • This solution can be good enough if you are only interested in d values ​​that are relatively small; for example, if getting the best result for d <= 1000 is good enough. Then the complexity decreases to O(n*1000) . This makes the algorithm approximative, but actually feasible for n=1000000 . (Measured after 400-500 seconds with CPython, 80-90 seconds with PyPy with a random subset of numbers from 0 to 10'000'000.)

  • If you still want to find the whole range, and if the general case is that there are long sequences, a notable improvement is to stop as soon as d is too big to get an even longer sequence.

+19
Aug 10 '13 at 10:40 on
source share

UPDATE: I found an article on this issue, you can download it here .

Here is a solution based on dynamic programming. This requires O (n ^ 2) time complexity and O (n ^ 2) space complexity and does not use hashing.

Suppose that all numbers are stored in an array a in ascending order, and n preserves its length. The 2D array l[i][j] defines the length of the longest equally spaced subsequence ending in a[i] and a[j] and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i <j <k).

 lmax = 2 l = [[2 for i in xrange(n)] for j in xrange(n)] for mid in xrange(n - 1): prev = mid - 1 succ = mid + 1 while (prev >= 0 and succ < n): if a[prev] + a[succ] < a[mid] * 2: succ += 1 elif a[prev] + a[succ] > a[mid] * 2: prev -= 1 else: l[mid][succ] = l[prev][mid] + 1 lmax = max(lmax, l[mid][succ]) prev -= 1 succ += 1 print lmax 
+11
Aug 11 '13 at 16:34
source share

Algorithm

  • The main loop moving through the list
  • If the number is found in the list of preliminary calculations, then it belongs to all sequences in this list, recounts all sequences with count + 1
  • Delete all previously calculated for the current item.
  • Recalculate new sequences, where the first element is in the range from 0 to the current, and the second is the current bypass element (in fact, not from 0 to the current, we can use the fact that the new element should not be larger than this max (a), and the new list should be able to grow longer than already found)

So, for the list [1, 2, 4, 5, 7] output will be (this is a bit dirty, try the code itself and see)

  • index 0, element 1 :
    • if 1 in precalc? No - do nothing.
    • Nothing to do
  • index 1, element 2 :
    • if 2 in precalc? No - do nothing.
    • check if 3 = 1 + ( 2 - 1 ) * 2 in our set? No - do nothing.
  • index 2, element 4 :
    • if 4 in precalc? No - do nothing
      • check if 6 = 2 + ( 4 - 2 ) * 2 in our set? No
      • check if 7 = 1 + ( 4 - 1 ) * 2 in our set? Yes - add a new element {7: {3: {'count': 2, 'start': 1}}} 7 - a list item, 3 - a step.
  • index 3, element 5 :
    • if 5 in prealc? No - do nothing
      • do not check 4 because 6 = 4 + ( 5 - 4 ) * 2 is less than the calculated element 7
      • check if 8 = 2 + ( 5 - 2 ) * 2 in our set? No
      • check 10 = 2 + ( 5 - 1 ) * 2 - more than max (a) == 7
  • index 4, element 7 :
    • if 7 in precalc? Yes - put it in the result
      • do not check 5 because 9 = 5 + ( 7 - 5 ) * 2 is greater than max (a) == 7

result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into a sequence

Complexity

It should not be more than O (N ^ 2), and I think it is less due to an earlier termination of the search for new sequences, I will try to analyze in more detail later

the code

 def add_precalc(precalc, start, step, count, res, N): if step == 0: return True if start + step * res[1]["count"] > N: return False x = start + step * count if x > N or x < 0: return False if precalc[x] is None: return True if step not in precalc[x]: precalc[x][step] = {"start":start, "count":count} return True def work(a): precalc = [None] * (max(a) + 1) for x in a: precalc[x] = {} N, m = max(a), 0 ind = {x:i for i, x in enumerate(a)} res = (0, {"start":0, "count":0}) for i, x in enumerate(a): for el in precalc[x].iteritems(): el[1]["count"] += 1 if el[1]["count"] > res[1]["count"]: res = el add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N) t = el[1]["start"] + el[0] * el[1]["count"] if t in ind and ind[t] > m: m = ind[t] precalc[x] = None for y in a[i - m - 1::-1]: if not add_precalc(precalc, y, x - y, 2, res, N): break return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])] 
+3
Aug 10 '13 at 9:28
source share

Here is another answer that works with O(n^2) and without any noticeable memory requirements, other than switching a list to a set.

The idea is rather naive: like the original poster, it is greedy and just checks how far you can extend the subsequence from each pair of points, but by first checking that we are at the beginning of the subsequence. In other words, from points a and b you check how far you can go to b + (ba) , b + 2*(ba) , ... but only if a - (ba) is not already in the set of all points. If so, then you have already seen the same subsequence.

The trick is to convince yourself that this simple optimization is enough to reduce complexity to O(n^2) from the original O(n^3) . This left an exercise for the reader :-) Time here is competitive with other O(n^2) solutions.

 A = [1, 4, 5, 7, 8, 12] # in sorted order Aset = set(A) lmax = 2 for j, b in enumerate(A): for i in range(j): a = A[i] step = b - a if b + step in Aset and a - step not in Aset: c = b + step count = 3 while c + step in Aset: c += step count += 1 #print "found %d items in %d .. %d" % (count, a, c) if count > lmax: lmax = count print lmax 
+3
Aug 15 '13 at 6:25
source share

Now your solution is O(N^3) (you said O(N^2) per index ). Here, O(N^2) time and O(N^2) memory solutions.

Idea

If we know a subsequence passing through indices i[0] , i[1] , i[2] , i[3] , we should not try a subsequence that begins with i[1] and i[2] or i[2] and i[3]

Note. I edited this code to make it a little easier by sorting a , but it will not work for equal elements. You can check the number max the number of equal elements in O(N) easily

pseudo code

I am only looking for maximum length, but that doesn’t change anything

 whereInA = {} for i in range(n): whereInA[a[i]] = i; // It doesn't matter which of same elements it points to boolean usedPairs[n][n]; for i in range(n): for j in range(i + 1, n): if usedPair[i][j]: continue; // do not do anything. It was in one of prev sequences. usedPair[i][j] = true; //here quite stupid solution: diff = a[j] - a[i]; if diff == 0: continue; // we can't work with that lastIndex = j currentLen = 2 while whereInA contains index a[lastIndex] + diff : nextIndex = whereInA[a[lastIndex] + diff] usedPair[lastIndex][nextIndex] = true ++currentLen lastIndex = nextIndex // you may store all indicies here maxLen = max(maxLen, currentLen) 

Thoughts on memory usage

O(N^2) time is very slow for 1,000,000 items. But if you are going to run this code on so many elements, the biggest problem will be memory usage.
What can be done to reduce it?

  • Change boolean arrays to bit fields to save more logic elements per bit.
  • Make each next logical array shorter because we only use usedPairs[i][j] if i < j

A bit of heuristics:

  • Keep only pairs of pointers used. (Conflicts with the first idea)
  • Delete usedPairs that will never use more (for those i , j that have already been selected in the loop)
+2
Aug 10 '13 at 9:20
source share

This is my 2 cents.

If you have a list called input:

 input = [1, 4, 5, 7, 8, 12] 

You can build a data structure that for each of these points (except the first) will tell you how far this point is from any of its predecessors:

 [1, 4, 5, 7, 8, 12] x 3 4 6 7 11 # distance from point i to point 0 xx 1 3 4 8 # distance from point i to point 1 xxx 2 3 7 # distance from point i to point 2 xxxx 1 5 # distance from point i to point 3 xxxxx 4 # distance from point i to point 4 

Now that you have the columns, you can consider the i-th input element (which is input[i] ) and each number n in its column.

The numbers that belong to a series of equidistant numbers that include input[i] are those that have n * j at the i-th position of their column, where j is the number of matches already found when moving the columns from left to right, plus the predecessor k-th input[i] , where k is the index n in the input[i] column.

Example: if we consider i = 1 , input[i] = 4 , n = 3 , then we can identify a sequence that comprehends 4 ( input[i] ), 7 (since it has 3 at position 1 its column) and 1 , since k is 0, so we take the first predecessor i .

Possible implementation (sorry if the code does not use the same notation as the explanation):

 def build_columns(l): columns = {} for x in l[1:]: col = [] for y in l[:l.index(x)]: col.append(x - y) columns[x] = col return columns def algo(input, columns): seqs = [] for index1, number in enumerate(input[1:]): index1 += 1 #first item was sliced for index2, distance in enumerate(columns[number]): seq = [] seq.append(input[index2]) # k-th pred seq.append(number) matches = 1 for successor in input[index1 + 1 :]: column = columns[successor] if column[index1] == distance * matches: matches += 1 seq.append(successor) if (len(seq) > 2): seqs.append(seq) return seqs 

The longest:

 print max(sequences, key=len) 
+1
Aug 10 '13 at 21:46
source share

Moving the array, saving the record of the optimal result / s and the table with

(1) index is the difference of elements in a sequence,
(2) count - the number of elements in the sequence so far and
(3) last recorded item.

For each element of the array, look at the difference from each previous element of the array; if this element is the last in the sequence indexed in the table, adjust this sequence in the table and update the best sequence, if applicable, otherwise start a new sequence if the current max is longer than the length of the possible sequence.

Backward scanning, we can stop our scanning when d is greater than the middle of the range of arrays; or when the maximum current is greater than the length of the possible sequence, for d is greater than the largest indexed difference. Sequences in which s[j] greater than the last element in the sequence are deleted.

I converted my JavaScript code to Python (my first Python code):

 import random import timeit import sys #s = [1,4,5,7,8,12] #s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195] #s = [0, 6, 7, 10, 11, 12, 16, 18, 19] m = [random.randint(1,40000) for r in xrange(20000)] s = list(set(m)) s.sort() lenS = len(s) halfRange = (s[lenS-1] - s[0]) // 2 while s[lenS-1] - s[lenS-2] > halfRange: s.pop() lenS -= 1 halfRange = (s[lenS-1] - s[0]) // 2 while s[1] - s[0] > halfRange: s.pop(0) lenS -=1 halfRange = (s[lenS-1] - s[0]) // 2 n = lenS largest = (s[n-1] - s[0]) // 2 #largest = 1000 #set the maximum size of d searched maxS = s[n-1] maxD = 0 maxSeq = 0 hCount = [None]*(largest + 1) hLast = [None]*(largest + 1) best = {} start = timeit.default_timer() for i in range(1,n): sys.stdout.write(repr(i)+"\r") for j in range(i-1,-1,-1): d = s[i] - s[j] numLeft = n - i if d != 0: maxPossible = (maxS - s[i]) // d + 2 else: maxPossible = numLeft + 2 ok = numLeft + 2 > maxSeq and maxPossible > maxSeq if d > largest or (d > maxD and not ok): break if hLast[d] != None: found = False for k in range (len(hLast[d])-1,-1,-1): tmpLast = hLast[d][k] if tmpLast == j: found = True hLast[d][k] = i hCount[d][k] += 1 tmpCount = hCount[d][k] if tmpCount > maxSeq: maxSeq = tmpCount best = {'len': tmpCount, 'd': d, 'last': i} elif s[tmpLast] < s[j]: del hLast[d][k] del hCount[d][k] if not found and ok: hLast[d].append(i) hCount[d].append(2) elif ok: if d > maxD: maxD = d hLast[d] = [i] hCount[d] = [2] end = timeit.default_timer() seconds = (end - start) #print (hCount) #print (hLast) print(best) print(seconds) 
0
10 . '13 15:59
source share

, : , K = 1 . , O (N ^ 2). C, , 3 , N = 20000 M = 28000 32- .

0
20 . '13 15:27
source share


1. .
2. . 1. .
2. .

0
Feb 25 '17 at 4:12
source share



All Articles