How to speed up this Python code?

I have the following tiny Python method, which for today . The performance point (according to my profile, 95% of the execution time is spent here) in a much larger program:

def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) for i in xrange(len(seq) - l + 1): score = 0.0 for j in xrange(l): score += logProbs[j][seq[j + i]] ret = max(ret, score) return ret 

The code runs in a Jython Python implementation, not CPython, if that matters. seq is a DNA sequence sequence of the order of 1000 elements. logProbs - a list of dictionaries, one for each position. The goal is to find the maximum estimate for any length l (of the order of 10-20 elements) of the subsequence seq .

I understand that this loop is inefficient due to the excessive load on the interpretation and will be much faster in the statically compiled / JIT'-language. However, I do not want to switch languages. First, I need the JVM language for the libraries I use, and this view limits my options. Secondly, I do not want to translate this code into a lower-level JVM language. Nevertheless, I am ready to rewrite this access point to something else if necessary, although I do not know how to connect it or what would be the overhead.

In addition to the single-threaded slowness of this method, I also cannot get the program to scale much more than 4 processors in terms of parallelization. Given that he spends almost all of his time at the 10-line access point that I posted, I can't figure out what could be the bottleneck.

+6
source share
8 answers

If topScore is called multiple times for the same seq , you could memoize its value.

eg. http://code.activestate.com/recipes/52201/

+2
source

The reason is slow because it is O (N * N)

Maximum subsequence algorithm can help you improve this

+2
source

I don't know what I'm doing, but maybe this can speed up your algorithm:

 ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) scores = collections.defaultdict(int) for j in xrange(l): prob = logProbs[j] for i in xrange(len(seq) - l + 1): scores[i] += prob[seq[j + i]] ret = max(ret, max(scores.values())) 
+1
source

What about precalculating xrange(l) outside the for i loop?

+1
source

Nothing jumps like slow. I could rewrite the inner loop as follows:

 score = sum(logProbs[j][seq[j+i]] for j in xrange(l)) 

or even:

 seqmatch = zip(seq[i:i+l], logProbs) score = sum(posscores[base] for base, posscores in seqmatch) 

but I don’t know that it will save a lot of time.

Perhaps it would be a little faster to store DNA bases as integers 0-3 and search for estimates from a tuple instead of a dictionary. There will be a blow at translating letters into numbers, but this needs to be done only once.

0
source

Definitely use numpy and store logProbs as a 2D array, not a list of dictionaries. Also save seq as a 1D array of (short) integers, as suggested above. This will help if you do not need to do these conversions every time you call the function (performing these transformations inside the function will not save you much). You can eliminate the second cycle:

 import numpy as np ... print np.shape(self.logProbs) # (20, 4) print np.shape(seq) # (1000,) ... def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) for i in xrange(len(seq) - l + 1): score = np.sum(logProbs[:,seq[i:i+l]]) ret = max(ret, score) return ret 

What you do after that depends on which of these two data items changes most often:

If logProbs usually stays the same and you want to run many DNA sequences through it, then consider stacking your DNA sequences as a 2D array. numpy can quickly move around a 2D array, so if you have 200 DNA sequences, it will take a little longer than one.

Finally, if you really need to speed things up, use scipy.weave. This is a very easy way to write a few lines of fast C to speed up the loop. However, I recommend scipy> 0.8.

0
source

You can try to raise more than just self.logProbs outside of loops:

 def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) lrange = range(l) for i in xrange(len(seq) - l + 1): score = 0.0 for j in lrange: score += logProbs[j][seq[j + i]] if score > ret: ret = score # avoid lookup and function call return ret 
0
source

I doubt this will be significant, but you can try changing:

  for j in xrange(l): score += logProbs[j][seq[j + i]] 

to

  for j,lP in enumerate(logProbs): score += lP[seq[j + i]] 

or even raise this listing outside the seq loop.

0
source

All Articles