Given two lists of integers, find each pair within the distance from each other <O (N ^ 2)

I have two sorted lists of integers. I would like to find all pairs of integers from the first and second list, respectively, which are at a certain distance from each other.

The naive approach is to test each pair, resulting in O (N ^ 2) time. I am sure there is a way to do this in O (N * logN) or perhaps shorter.

In python, the naive approach of O (N ^ 2) is as follows:

def find_items_within(list1, list2, within): for l1 in list1: for l2 in list2: if abs(l1 - l2) <= within: yield (l1, l2) 

Extra points for pythonic answers.

Application Note

I just wanted to point out the purpose of this little riddle. I am looking for a document and want to find all occurrences of one term at a certain distance from another term. First you will find the term vectors of both terms, then you can use the algorithms described below to find out if they are at a given distance from each other.

+6
source share
7 answers

This code is O (n * log (n) + m), where m is the size of the response.

 def find_items_within(l1, l2, dist): l1.sort() l2.sort() b = 0 e = 0 ans = [] for a in l1: while b < len(l2) and a - l2[b] > dist: b += 1 while e < len(l2) and l2[e] - a <= dist: e += 1 ans.extend([(a,x) for x in l2[b:e]]) return ans 

In the worst case, it is possible that m = n*n , but if the answer is only a small subset of all possible pairs, it is much faster.

+5
source

It is impossible to do this better than O(n^2) , because there are O(n^2) pairs, and for within = infinity you need to provide all of them.


To find the number of these pairs is another story, and this can be done by finding the first index for each element e , which is enough within-e < arr[idx] . The idx index can be found efficiently using binary search, for example, which will help you find an O(nlogn) solution for finding the number of these pairs.

This can also be done in linear time ( O(n) ), since you do not need to perform a binary search for all elements after the first range [a,b] , note that for each other range [a',b'] - if a>a' then b>=b' - so you really need to iterate over the lists with two pointers and never look back to get linear time complexity.

pseudo-code: (for solving linear time)

 numPairs <- 0 i <- 0 a <- 0 b <- 0 while (i < list1.length): while (a < i && list1[i] - list2[a] > within): a <- a+1 while (b < list2.length && list2[b] - list1[i] < within): b <- b+1 if (b > a): numPairs <- numPairs + (ba) i <- i+1 return numPairs 

(I made some corrections from the original pseudo-code, because the first one sought to find the number of pairs within the range in one list - and does not match between the two lists, sorry for that)

+7
source

Here is something with the same interface that you specified:

 def find_items_within(list1, list2, within): i2_idx = 0 shared = [] for i1 in list1: # pop values to small while shared and abs(shared[0] - i1) > within: shared.pop(0) # insert new values while i2_idx < len(list2) and abs(list2[i2_idx] - i1) <= within: shared.append(list2[i2_idx]) i2_idx += 1 # return result for result in zip([i1] * len(shared), shared): yield result for item in find_items_within([1,2,3,4,5,6], [3,4,5,6,7], 2): print item 

Not very pretty, but it should do the trick in O(N*M) , where N is the length of list1 and M list of common pairs per element (given that the elements omitted and added to shared are an average constant).

+2
source

It works:

 from itertools import takewhile def myslice(lst,start,stop,stride=1): stop = len(lst) if stop is None else stop for i in xrange(start,stop,stride): yield lst[i] def find_items_within(lst1,lst2,within): l2_start = 0 for l1 in lst1: try: l2_start,l2 = next( (i,x) for i,x in enumerate(myslice(lst2,l2_start,None),l2_start) if abs(l1-x) <= within ) yield l1,l2 for l2 in takewhile(lambda x:(abs(l1-x) <= within), myslice(lst2,l2_start+1,None)): yield l1,l2 except StopIteration: pass x = range(10) y = range(10) print list(find_items_within(x,y,2.5)) 
0
source

Depending on the distribution of values โ€‹โ€‹in your lists, you can speed up the process using binning: take the range in which all your values โ€‹โ€‹fall ( min(A+B), max(A+B) ), and divide this range into cells of the same same size as the distance youre considering. Then, to find all the pairs, you only need to compare the values โ€‹โ€‹in the basket or in the neighboring cells. If your values โ€‹โ€‹are shared between many cells, this is an easy way to avoid M * N comparisons.

Another way that in practice can be just as simple: do some sort of limited linear scan. Enter the index on list A and list B, starting from the beginning. At each iteration, move the pointer to list A (start with the first element), call this element A0. Then add the index to list B. Remember the last value of B, which is less than A0-D (this is where you want to start for the next iteration). But keep moving forward until you find the values โ€‹โ€‹between A0-D and A0 + D - these are the pairs you are looking for. As soon as the values โ€‹โ€‹in B become greater than A0 + D, stop this iteration and start the next one: move one element further to B and start scanning B from the last place where B was <A0-D.

If you have an average constant number of neighboring pairs per element, I think it should be O (M + N)?

0
source

This method uses a dictionary whose keys are possible list2 values, and whose values โ€‹โ€‹are a list of list1 values โ€‹โ€‹that are at a distance from this list2 value.

 def find_items_within(list1, list2, within): a = {} for l1 in list1: for i in range(l1-within, l1+within+1): if i not in a: a[i] = [] a[i].append(l1) for l2 in list2: if l2 in a: for l1 in a[l2]: yield(l1, l2) 

The difficulty for this is disgusting. for list 1 of size M and list 2 of size N and a within size W, this is O (log (M * W) * (M * W + N)). In practice, I think this works very well for small W.

Bonus : this also works with unsorted lists.

0
source

You can find integers from list2 that are in the range [x - within, x + within] for all x from list1 in linear time ( O(n) ) using the "scan line" method (see How to find all overlap intervals and Sub O (n ^ 2) for counting nested intervals? ).

To list the corresponding intervals from list1 , you need O(m) time, when m is the number of intervals, that is, the general O(n*m) algorithm:

 from collections import namedtuple from heapq import merge def find_items_within(list1, list2, within): issorted = lambda L: all(x <= y for x, y in zip(L, L[1:])) assert issorted(list1) and issorted(list2) and within >= 0 # get sorted endpoints - O(n) (due to list1, list2 are sorted) Event = namedtuple('Event', "endpoint x type") def get_events(lst, delta, type): return (Event(x + delta, x, type) for x in lst) START, POINT, END = 0, 1, 2 events = merge(get_events(list1, delta=-within, type=START), get_events(list1, delta=within, type=END), get_events(list2, delta=0, type=POINT)) # O(n * m), m - number of points in `list1` that are # within distance from given point in `list2` started = set() # started intervals for e in events: # O(n) if e.type is START: # started interval started.add(ex) # O(m) is worst case (O(1) amortized) elif e.type is END: # ended interval started.remove(ex) # O(m) is worst case (O(1) amortized) else: # found point assert e.type is POINT for x in started: # O(m) yield x, ex 

To allow duplicate values โ€‹โ€‹in list1 ; you can add an index for each x in Event and use the index -> x dictionary instead of the set started .

0
source

All Articles