How to quickly find first a few list items from a list of large integers?

I have a list of approximately 100,000 sorted even integers ranging from 10 ^ 12 to 10 ^ 14. My goal is to find the first number x in the list, so x*2 also a member of the list. Since my list is quite long, speed is very important.

My first thought was to simply iterate over the list and check if each element was also there when it was multiplied by 2, but after its implementation it is clear that it is too slow for my purposes.

My next thought was to decompose each item in the list into its simple decomposition using SymPy factorint , and then search my expanded list for the same decomposition, with the exception of extra 2. This didn't turn out to be faster obvious, but I feel that there should be a way to use simple decomposition, if not something else.

+6
source share
5 answers

You can iterate over two iterators in your list: one points to the current element and one points to the first one, higher or equal to it, double. It will be O (N).

Here is a draft idea:

 l = [1, 3, 5, 7, 10, 12, 15] # ... j = 0 for i in range(0, len(l)): while l[j] < 2*l[i]: j += 1 if j == len(l): return -1 if l[j] == 2*l[i]: return i 

Edit: After comments in another answer, several performance tests show that this version will be much faster (3 times in my tests), eliminating multiplications, calls len and decreasing the number in the list:

 j = 0 s = len(l) for i in range(0, s): l_i = l[i] l_i2 = l_i<<1 while l[j] < l_i2: j += 1 if j == s: return -1 if l[j] == l_i2: return i 
+7
source

Colin Pitrat's solution is very good, but I think you can use it when using sortedcontainers.SortedSet

I created a file with 1,000,000 random numbers (using numpy.random.randint) and checked that there is at least one number (in the middle of the file) that satisfies your condition.

Compare both approaches:

 import filecmp from sortedcontainers import SortedSet from timeit import Timer def load2list(): with open('data', 'r') as f: return [int(line) for line in f] def save_result(i, fn): with open(fn, 'a') as f: print(i, file=f) def test_Colin_Pitrat(): j = 0 for i in range(0, len(l)): while l[j] < 2*l[i]: j += 1 if j == len(l): return -1 if l[j] == 2*l[i]: save_result(l[i], 'Colin_Pitrat.out') return l[i] def test_SortedSet(): for i in sorted_set: if i<<1 in sorted_set: save_result(i, 'SortedSet.out') return i return -1 if __name__=='__main__': timeit_number = 10000 l = load2list() sorted_set = SortedSet(l) print('len(l):\t\t%d' % (len(l))) print('len(sorted_set):\t%d' % (len(sorted_set))) print('Timeit results with %d executions:' %timeit_number) print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number)) print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number)) print("filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'):\t%s" % (filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'))) 

Output:

 len(l): 1000001 len(sorted_set): 999504 Timeit results with 10000 executions: Colin_Pitrat: 35.94529931032006 SortedSet: 2.548847197918647 filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'): True 

PS, as you can see, SortedSet is very fast.

UPDATE: (now I am testing at home, where my computer is much slower , so I will reduce the number of executions to 1.000)

As Colin Pitrat said, I am now generating data (about 100,000 numbers) for the worst case scenario - when no match is found. Now I will compare three functions: test_Colin_Pitrat, test_Colin_Pitrat2 (customized version), test_SortedSet ...

Script data generator:

 import numpy as np l = np.random.randint(10**7, 10**9, 200000) l = l[ l % 2 > 0 ] np.savetxt('data', np.sort(l), fmt='%d') 

code:

 import filecmp from sortedcontainers import SortedSet from timeit import Timer def load2list(): with open('data', 'r') as f: return [int(line) for line in f] def save_result(i, fn): with open(fn, 'a') as f: print(i, file=f) def test_Colin_Pitrat(): j = 0 for i in range(0, len(l)): while l[j] < 2*l[i]: j += 1 if j == len(l): return -1 if l[j] == 2*l[i]: return l[i] def test_Colin_Pitrat2(): j = 0 s = len(l) for i in range(0, s): l_i = l[i] l_i2 = l_i<<1 while l[j] < l_i2: j += 1 if j == s: return -1 if l[j] == l_i2: return l[j] def test_SortedSet(): for i in sorted_set: if i<<1 in sorted_set: return i return -1 if __name__=='__main__': timeit_number = 1000 l = load2list() sorted_set = SortedSet(l) print('len(l):\t\t%d' % (len(l))) print('len(sorted_set):\t%d' % (len(sorted_set))) print('Timeit results for %d executions:' %timeit_number) print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number)) print('Colin_Pitrat2:\t', Timer(test_Colin_Pitrat2).timeit(timeit_number)) print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number)) 

Output:

 len(l): 99909 len(sorted_set): 99899 Timeit results for 1000 executions: Colin_Pitrat: 153.04753258882357 Colin_Pitrat2: 103.68264272815443 SortedSet: 99.59669211136577 

Conclusion: Colin_Pitrat2 is 33% faster compared to Colin_Pitrat and almost as fast as SortedSet.

+3
source

I will add another answer because I already overloaded my previous ...

Now I came up with a new idea - the SortedSets intersection, which works extremely fast compared to the modified Colin and my previous solution. The idea is to create two SortedSets:

l2 is the intersection of two lists / sets: the original l and the one that contains all the elements of l multiplied by 2: SortedSet(x*2 for x in l) .

l1 is a SortedSet containing all elements belonging to l2 divided by 2: SortedSet(x//2 for x in l2)

code:

 from datetime import datetime as dt from sortedcontainers import SortedSet from timeit import Timer def load2list(): with open('data', 'r') as f: return [int(line) for line in f] def test_Colin_Pitrat2(): j = 0 s = len(l) for i in range(0, s): l_i = l[i] l_i2 = l_i<<1 while l[j] < l_i2: j += 1 if j == s: return -1 if l[j] == l_i2: return l[i] def test_SortedSet(): for i in sorted_set: if i<<1 in sorted_set: return i return -1 def test_SetIntersection(): for i in l1: if i*2 in l2: return i return -1 if __name__=='__main__': start_ts = dt.now() timeit_number = 10000 l = load2list() print('load2list() took:\t%d microseconds' %((dt.now() - start_ts).microseconds)) start_ts = dt.now() sorted_set = SortedSet(l) l2 = SortedSet(l).intersection(SortedSet(x*2 for x in l)) l1 = SortedSet(x//2 for x in l2) print('preparing sets took:\t%d microseconds' %((dt.now() - start_ts).microseconds)) print('len(l):\t\t%d' % (len(l))) print('len(l1):\t%d' % (len(l1))) print('len(l2):\t%d' % (len(l2))) print('len(sorted_set):\t%d' % (len(sorted_set))) print('Timeit results for %d executions:' %timeit_number) print('Colin_Pitrat2:\t\t', Timer(test_Colin_Pitrat2).timeit(timeit_number)) print('SortedSet:\t\t', Timer(test_SortedSet).timeit(timeit_number)) print('SetIntersection:\t', Timer(test_SetIntersection).timeit(timeit_number)) 

Output:

 load2list() took: 230023 microseconds preparing sets took: 58106 microseconds len(l): 498786 len(l1): 256 len(l2): 256 len(sorted_set): 498562 Timeit results for 10000 executions: Colin_Pitrat2: 23.557948959065648 SortedSet: 6.658937808213555 SetIntersection: 0.012540539222982261 

PS I really like this question because it gave me a chance to learn something new. And I also like Colin's algorithm - he's smart. I have already supported this.

+1
source

You can split your list in half so that the bottom list contains all the numbers from the first in the source list to the last number, which is equal to or less than half the maximum value in your list. This will speed up your search as you will iterate over only half of the list and search if x * 2 is on the other half. See example below.

 l = [1, 3, 5, 7, 10, 12, 15] max_l = l[-1] for i in range(len(l)): if l[i] > max_l/2: break pos_mid_l = i print pos_mid_l # returns position 3 lower_l = l[:pos_mid_l+1] upper_l = l[pos_mid_l+1:] print lower_l # returns lower half of your list print upper_l # returns upper half of your list for i in lower_l: if i < upper_l[0]/2: if i*2 in lower_l: ans = i break else: if i*2 in upper_l: ans = i break print ans # Your answer :) 
0
source

It seems to me that the best algorithm would use a pre-sorted list and set.

 #!/usr/local/pypy3-2.4.0/bin/pypy3 import time import random def get_random_numbers(): result = [] for counter in range(99998): random_number = (random.randrange(10**12, 10**14) // 2) * 2 result.append(random_number) # Make sure there at least one solution to the problem moderately_large = 10**14 // 2 result.append(moderately_large) result.append(moderately_large * 2) # This sort is of course O(n*logn), but I don't think you're # counting that, are you? result.sort() return result def get_best(list_): # This is O(n) set_ = set(list_) for value in list_: if value * 2 in set_: return value return None def main(): list_ = get_random_numbers() time0 = time.time() result = get_best(list_) time1 = time.time() print(result) print('{} seconds'.format(time1 - time0)) main() 

NTN

0
source

All Articles