Best way to iterate over a byte / unicode string in Cython

Question

Best way to iterate over a byte / unicode string in Cython

_{I'm just starting out with Cython, and it’s also very difficult for me to understand Google Cython, so sorry in advance.}

I am reimplementing the Python function with Cython. This is pretty similar to Python:

def func(s, numbers=None): if numbers: some_dict = numbers else: some_dict = default return sum(some_dict[c] for c in s)

And it works fine on Python 2 and 3. But if I try to enter s and c , it breaks at least one of the versions of Python. I tried:

 def func(char *s, numbers=None): if numbers: some_dict = numbers else: some_dict = default cdef char c cdef double m = 0.0 for c in s: m += some_dict[<bytes>c] return m

This is the only thing I need to be honest, and it gives decent acceleration in Python 2, but breaks down in Python 3. After reading this piece of Cython docs, I thought the following would work in Python 3:

 def func(unicode s, numbers=None): if numbers: some_dict = numbers else: some_dict = default cdef double m = 0.0 for c in s: m += some_dict[c] return m

but actually it raises the KeyError value and it seems that c is still char (missing key 80 if s starts with 'P' ), but when I print(type(c)) it says <class 'str'> .

Note that the source untyped code works in both versions, but is about twice as slow as the working typed version in Python 2.

So, how do I get it to work on Python 3 in general, and how can I get it to work on both versions of Python at once? Can / should I wrap type declarations in type / version checks? Or can I write two functions and conditionally assign one of them to a public name?

PS I'm fine, only allowing ASCII characters in a string if that matters, but I doubt it is, since Cython seems to prefer explicit encoding / decoding.

Edit: I also tried explicitly encoding and iterating over the byte string, which would make sense, but the following code:

 def func(s, numbers=None): if numbers: some_dict = numbers else: some_dict = default cdef double m = 0.0 cdef bytes bs = s.encode('ascii') cdef char c for c in bs: m += some_dict[(<bytes>c).decode('ascii')] return m

3 times slower than my first attempt in Python 2 (close to the speed of a pure Python function) and almost 2 times slower in Python 3.

+4

c python string unicode cython

Lev levitsky Mar 11 '13 at 11:51

source share

1 answer

Turnaev evgeny · Answer 1 · 2013-03-12T12:51:58+0000

foo.h

 // #include <unistd.h>; // for ssize_t double foo(char * str, ssize_t str_len, double weights[256]){ double output = 0.0; int i; for(i = 0; i < str_len; ++i){ output += weights[str[i]]; } return output; }

 from cpython.string cimport PyString_GET_SIZE, PyString_Check, PyString_AS_STRING cdef extern from "foo.h": double foo(char * str, ssize_t str_len, double weights[256]) cdef class Numbers: cdef double nums[256] def __cinit__(self, py_numbers): for x in range(256): self.nums[i] = py_numbers[i] def py_foo(my_str, Numbers nums_inst): cdef: double res # check here my_str is BYTEstring if not PyString_Check(my_str): raise TypeError("bytestring expected got %s instead" % type(my_str)) res = foo(PyString_AS_STRING(my_str), PyString_GET_SIZE(my_str), nums_inst.nums) return res

(unverified)

Best way to iterate over a byte / unicode string in Cython

More articles: