What is a faster way to find a value in a list of tuples?

Question

What is a faster way to find a value in a list of tuples?

I am looking for a country by ip range for tens of millions of lines. I am looking for a faster way to search.

I have 180K tuples in this form:

>>> data = ((0, 16777215, 'ZZ'), ... (1000013824, 1000079359, 'CN'), ... (1000079360, 1000210431, 'JP'), ... (1000210432, 1000341503, 'JP'), ... (1000341504, 1000603647, 'IN'))

(Integers are IP addresses converted to primes.)

This works correctly, but takes too much time:

 >>> ip_to_lookup = 999 >>> country_result = [country ... for (from, to, country) in data ... if (ip_to_lookup >= from) and ... (ip_to_lookup <= to)][0] >>> print country_result ZZ

Can someone point me in the right direction for a faster way to do this search? Using the method above, 100 searches take 3 seconds. Sense, I think 10M lines will take several days.

+8

python sorting search tuples

exzackley Mar 17 '12 at 17:47

source share

2 answers

Assuming your situation meets some requirements, there is a way to complicate the average O(1) execution time, but the complexity of the space suffers.

Data must be static; All data must be processed before any searches.
For an arbitrary IP address, there must be a way to define meaningful octets .
There should be enough space to add a key for each meaningful value for each country.

Below is a very naive implementation. It selects the first two IP octets as significant, no matter what, then combines the significant octets as integers and gradually adds a key for each value between mininum and maximum. As you can probably tell, there are many opportunities for improvement.

 from socket import inet_ntoa from struct import pack data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN')) def makedict(minip, maxip, country): d = {} for i in xrange(key(minip), key(maxip)+1): d[i] = country return d def key(ip): octets = inet_ntoa(pack('>L', ip)).split('.') return int("".join(octets[0:2])); lookup = {} for lo, hi, cnt in data: lookup.update(makedict(lo, hi, cnt)) print lookup[key(999)] # ZZ print lookup[key(1000215555)] # JP

+1

Matt eckert Mar 17 '12 at 20:59

source share

Niklas B. · Accepted Answer · 2012-03-17T18:01:24+0000

You can use the bisect module to perform a binary search after sorting a dataset:

 from operator import itemgetter import bisect data = ((0, 16777215, 'ZZ'), (1000013824, 1000079359, 'CN'), (1000079360, 1000210431, 'JP'), (1000210432, 1000341503, 'JP'), (1000341504, 1000603647, 'IN')) sorted_data = sorted(data, key=itemgetter(0)) lower_bounds = [lower for lower,_,_ in data] def lookup_ip(ip): index = bisect.bisect(lower_bounds, ip) - 1 if index < 0: return None _, upper, country = sorted_data[index] return country if ip <= upper else None print(lookup_ip(-1)) # => None print(lookup_ip(999)) # => 'ZZ' print(lookup_ip(16777216)) # => None print(lookup_ip(1000013824)) # => 'CN' print(lookup_ip(1000400000)) # => 'IN'

The algorithmic complexity of searching is O(log n) here, not O(n) for a complete list view.

What is a faster way to find a value in a list of tuples?

More articles: