Convert ascii encoding to int and back in python (fast)

I have a file format (fastq format) that encodes a string of integers as a string, where each integer is represented by ascii code with an offset. Unfortunately, there are two encodings in common use: one with an offset of 33, and the other with an offset of 64. I usually have several 100 million lines of length 80-150 to convert from one offset to another. The simplest code I could come up with for this is:

def phred64ToStdqual(qualin): return(''.join([chr(ord(x)-31) for x in qualin])) 

This works fine, but it is not particularly fast. For 1 million lines, this takes about 4 seconds on my machine. If I switch to using a pair of dictons for translation, I can do this in about 2 seconds.

 ctoi = {} itoc = {} for i in xrange(127): itoc[i]=chr(i) ctoi[chr(i)]=i def phred64ToStdqual2(qualin): return(''.join([itoc[ctoi[x]-31] for x in qualin])) 

If I blindly run under cython, I get it in less than 1 second.
At C level it seems to be just casting to int, subtracting and then translating to char. I did not write this, but I guess it is a little faster. Any hints, including how best to code this in python, or even a version of cython for that, would be very helpful.

Thanks,

Sean

+6
performance python algorithm cython
source share
1 answer

If you look at the code for urllib.quote, there is something similar to what you are doing. It looks like this:

 _map = {} def phred64ToStdqual2(qualin): if not _map: for i in range(31, 127): _map[chr(i)] = chr(i - 31) return ''.join(map(_map.__getitem__, qualin)) 

Please note that the above function works if the mappings do not have the same length (in urllib.quote you should take "%" โ†’ "% 25".

But in fact, since each translation has the same length, python has a function that does this very quickly: maketrans and translate . You probably won't be much faster than:

 import string _trans = None def phred64ToStdqual4(qualin): global _trans if not _trans: _trans = string.maketrans(''.join(chr(i) for i in range(31, 127)), ''.join(chr(i) for i in range(127 - 31))) return qualin.translate(_trans) 
+4
source share

All Articles