How to speed up this Python code?

Question

How to speed up this Python code?

I have the following tiny Python method, which for today . The performance point (according to my profile, 95% of the execution time is spent here) in a much larger program:

def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) for i in xrange(len(seq) - l + 1): score = 0.0 for j in xrange(l): score += logProbs[j][seq[j + i]] ret = max(ret, score) return ret

The code runs in a Jython Python implementation, not CPython, if that matters. seq is a DNA sequence sequence of the order of 1000 elements. logProbs - a list of dictionaries, one for each position. The goal is to find the maximum estimate for any length l (of the order of 10-20 elements) of the subsequence seq .

I understand that this loop is inefficient due to the excessive load on the interpretation and will be much faster in the statically compiled / JIT'-language. However, I do not want to switch languages. First, I need the JVM language for the libraries I use, and this view limits my options. Secondly, I do not want to translate this code into a lower-level JVM language. Nevertheless, I am ready to rewrite this access point to something else if necessary, although I do not know how to connect it or what would be the overhead.

In addition to the single-threaded slowness of this method, I also cannot get the program to scale much more than 4 processors in terms of parallelization. Given that he spends almost all of his time at the 10-line access point that I posted, I can't figure out what could be the bottleneck.

+6

java performance python jython jvm

dsimcha Nov 17 '10 at 22:24

source share

8 answers

The reason is slow because it is O (N * N)

Maximum subsequence algorithm can help you improve this

+2

John la rooy Nov 17 '10 at 23:01

source share

I don't know what I'm doing, but maybe this can speed up your algorithm:

 ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) scores = collections.defaultdict(int) for j in xrange(l): prob = logProbs[j] for i in xrange(len(seq) - l + 1): scores[i] += prob[seq[j + i]] ret = max(ret, max(scores.values()))

+1

mouad Nov 17 '10 at 23:17

source share

What about precalculating xrange(l) outside the for i loop?

+1

Wangnick Nov 17 '10 at 23:19

source share

Nothing jumps like slow. I could rewrite the inner loop as follows:

 score = sum(logProbs[j][seq[j+i]] for j in xrange(l))

or even:

 seqmatch = zip(seq[i:i+l], logProbs) score = sum(posscores[base] for base, posscores in seqmatch)

but I don’t know that it will save a lot of time.

Perhaps it would be a little faster to store DNA bases as integers 0-3 and search for estimates from a tuple instead of a dictionary. There will be a blow at translating letters into numbers, but this needs to be done only once.

0

Thomas K Nov 17 '10 at 22:48

source share

Definitely use numpy and store logProbs as a 2D array, not a list of dictionaries. Also save seq as a 1D array of (short) integers, as suggested above. This will help if you do not need to do these conversions every time you call the function (performing these transformations inside the function will not save you much). You can eliminate the second cycle:

 import numpy as np ... print np.shape(self.logProbs) # (20, 4) print np.shape(seq) # (1000,) ... def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) for i in xrange(len(seq) - l + 1): score = np.sum(logProbs[:,seq[i:i+l]]) ret = max(ret, score) return ret

What you do after that depends on which of these two data items changes most often:

If logProbs usually stays the same and you want to run many DNA sequences through it, then consider stacking your DNA sequences as a 2D array. numpy can quickly move around a 2D array, so if you have 200 DNA sequences, it will take a little longer than one.

Finally, if you really need to speed things up, use scipy.weave. This is a very easy way to write a few lines of fast C to speed up the loop. However, I recommend scipy> 0.8.

0

kiyo Nov 17 '10 at 23:18

source share

You can try to raise more than just self.logProbs outside of loops:

 def topScore(self, seq): ret = -1e9999 logProbs = self.logProbs # save indirection l = len(logProbs) lrange = range(l) for i in xrange(len(seq) - l + 1): score = 0.0 for j in lrange: score += logProbs[j][seq[j + i]] if score > ret: ret = score # avoid lookup and function call return ret

0

John machin Nov 17 '10 at 23:32

source share

I doubt this will be significant, but you can try changing:

  for j in xrange(l): score += logProbs[j][seq[j + i]]

to

  for j,lP in enumerate(logProbs): score += lP[seq[j + i]]

or even raise this listing outside the seq loop.

0

Russell Borogove Nov 17 '10 at 23:35

source share

Rohan monga · Accepted Answer · 2010-11-18T05:53:13+0000

If topScore is called multiple times for the same seq , you could memoize its value.

eg. http://code.activestate.com/recipes/52201/

How to speed up this Python code?

More articles: