I have a file format (fastq format) that encodes a string of integers as a string, where each integer is represented by ascii code with an offset. Unfortunately, there are two encodings in common use: one with an offset of 33, and the other with an offset of 64. I usually have several 100 million lines of length 80-150 to convert from one offset to another. The simplest code I could come up with for this is:
def phred64ToStdqual(qualin): return(''.join([chr(ord(x)-31) for x in qualin]))
This works fine, but it is not particularly fast. For 1 million lines, this takes about 4 seconds on my machine. If I switch to using a pair of dictons for translation, I can do this in about 2 seconds.
ctoi = {} itoc = {} for i in xrange(127): itoc[i]=chr(i) ctoi[chr(i)]=i def phred64ToStdqual2(qualin): return(''.join([itoc[ctoi[x]-31] for x in qualin]))
If I blindly run under cython, I get it in less than 1 second.
At C level it seems to be just casting to int, subtracting and then translating to char. I did not write this, but I guess it is a little faster. Any hints, including how best to code this in python, or even a version of cython for that, would be very helpful.
Thanks,
Sean
performance python algorithm cython
seandavi
source share