Speed ​​up pair comparison of two lines

I have a list with two lines (containing a sequence and some spaces). I need to go through two lines and compare each character and count the places where both are not equal to the space

I have it, but it's too slow for my needs. Is there any way to speed this up?

from itertools import izip

def overlap(sequence_pair):
    return sum(nucleotide1 != ' ' and nucleotide2 != ' ' for nucleotide1, nucleotide2 in izip(*sequence_pair))

if __name__ == '__main__':
    sequence_pair = ['   AT GT ',
                     ' GTAGCG  ']
    print overlap(sequence_pair)
+4
source share
1 answer

It's hard to optimize your code in Pure Python, but if you use NumPy arrays from the very beginning instead of Python lists / strings, you can get significant speedup:

>>> import numpy as np
>>> sequence_pair = ['   AT GT '*10000, ' GTAGCG  '*10000]
>>> sequence_pair_arr = np.array([list('   AT GT '*10000), list(' GTAGCG  '*10000)])
>>> %timeit overlap(sequence_pair)
100 loops, best of 3: 14 ms per loop
>>> %timeit np.all(sequence_pair_arr != ' ', axis=0).sum()
100 loops, best of 3: 2.2 ms per loop
+5
source

All Articles