The fastest way to smooth an array of numpy unicode strings in Cython

Numpy string functions are very slow and less efficient than pure python lists. I want to optimize all the usual string functions using Cython.

For example, let’s take an array with 100,000 numbers of unicode strings with a data type of either unicode or object and lowecase each.

alist = ['JsDated', 'Ї'] * 50000 arr_unicode = np.array(alist) arr_object = np.array(alist, dtype='object') %timeit np.char.lower(arr_unicode) 51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

Using list comprehension is just as fast.

 %timeit [a.lower() for a in arr_unicode] 44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

For the object data type, we cannot use np.char . Understanding the list is 3 times faster.

 %timeit [a.lower() for a in arr_object] 16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

The only way I know how to do this in Cython is to create an empty array of objects and call the Python lower string method on each iteration.

 import numpy as np cimport numpy as np from numpy cimport ndarray def lower(ndarray[object] arr): cdef int i cdef int n = len(arr) cdef ndarray[object] result = np.empty(n, dtype='object') for i in range(n): result[i] = arr[i].lower() return result 

This gives a modest improvement.

 %timeit lower(arr_object) 11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

I tried to access memory directly with the data attribute ndarray as follows:

 def lower_fast(ndarray[object] arr): cdef int n = len(arr) cdef int i cdef char* data = arr.data cdef int itemsize = arr.itemsize for i in range(n): # no idea here 

I believe that data is one contiguous piece of memory containing all the raw bytes one by one. Access to these bytes is extremely fast, and it seems that converting these raw bytes will increase performance by 2 orders of magnitude. I found tolower C ++ - a function that can work, but I do not know how to connect it to Cython.

Update using the fastest method (does not work for unicode)

Here is the fastest method I found far from another SO post. This reduces all ascii characters by accessing numpy memory using the data attribute. I think this will distort other Unicode characters that have bytes between 65 and 90. But the speed is very good.

 cdef int f(char *a, int itemsize, int shape): cdef int i cdef int num cdef int loc for i in range(shape * itemsize): num = a[i] print(num) if 65 <= num <= 90: a[i] +=32 def lower_fast(ndarray arr): cdef char *inp inp = arr.data f(inp, arr.itemsize, arr.shape[0]) return arr 

It is 100 times faster than others and what I am looking for.

 %timeit lower_fast(arr) 103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
+7
python numpy unicode cython
source share
1 answer

It was a little faster than list comprehension for me on my machine, but if you want to support Unicode, this might be the fastest way to do this. You will need apt-get install libunistring-dev or whatever suits your OS / package manager.

In some C file, say _lower.c ,

 #include <stdlib.h> #include <string.h> #include <unistr.h> #include <unicase.h> void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len, s_len; uint8_t *s_ptr = *s, *lowered; while(s_ptr - *s < total_len) { s_len = u8_strlen(s_ptr); if (s_len == 0) { s_ptr += 1; continue; } lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len); memcpy(s_ptr, lowered, lower_len); free(lowered); s_ptr += s_len; } } 

Then in lower.pxd you do

 cdef extern from "_lower.c": cdef void _c_tolower(unsigned char **s, unsigned int total_len) 

Finally, in lower.pyx :

 cpdef void lower(ndarray arr): cdef unsigned char * _arr _arr = <unsigned char *> arr.data _c_tolower(&_arr, arr.shape[0] * arr.itemsize) 

On my laptop, I got 46 ms for understanding the list that you had above, and 37 ms for this method (and 0.8 ms for your lower_fast ), so it probably isn’t worth it, but I decided that I would print it in case you need an example of how to associate such a thing with Cython.

There are several points of improvement that I don’t know will make a big difference:

  • arr.data I think something like a square matrix? (I don’t know, I don’t use numpy for anything) and overlays the ends of shorter lines on \x00 s. I was too lazy to figure out how to get u8_tolower to watch for 0s, so I'm just manually fast-forwarding (which makes the sentence if (s_len == 0) ). I suspect that a single u8_tolower call will be significantly faster than doing it thousands of times.
  • I do a lot of free / memcpying. You can probably avoid this if you are smart.
  • I think every lowercase character in Unicode is no less wide than its uppercase variant, so this should not be done in any segfaults or rewritable buffers or just intersecting string problems, but don't take my word for it.

Not quite an answer, but hope this helps in your further investigations!

PS You will notice that this makes lowering in place, so the use will be like this:

 >>> alist = ['JsDated', 'Ї', '道德經', '  '] * 2 >>> arr_unicode = np.array(alist) >>> lower_2(arr_unicode) >>> for x in arr_unicode: ... print x ... jsdated ї道德經   jsdated ї道德經   >>> alist = ['JsDated', 'Ї'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct; 0.046072959899902344 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.037489891052246094 

EDIT

DUH, you modify the C function to look like this:

 void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len; uint8_t *lowered; lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len); memcpy(*s, lowered, lower_len); free(lowered); } 

and then he does it all in one go. It looks more dangerous from the point of view of the possible presence of some of the old data remaining above lower_len shorter than the original line ... In short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES THIS ONLY DOESN'T USE THIS IN PRODUCTION THIS WILL BE POSSIBLE TO BREAK.

40% faster anyway:

 >>> alist = ['JsDated', 'Ї'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.022463043975830078 
+1
source share

All Articles