It was a little faster than list comprehension for me on my machine, but if you want to support Unicode, this might be the fastest way to do this. You will need apt-get install libunistring-dev or whatever suits your OS / package manager.
In some C file, say _lower.c ,
#include <stdlib.h> #include <string.h> #include <unistr.h> #include <unicase.h> void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len, s_len; uint8_t *s_ptr = *s, *lowered; while(s_ptr - *s < total_len) { s_len = u8_strlen(s_ptr); if (s_len == 0) { s_ptr += 1; continue; } lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len); memcpy(s_ptr, lowered, lower_len); free(lowered); s_ptr += s_len; } }
Then in lower.pxd you do
cdef extern from "_lower.c": cdef void _c_tolower(unsigned char **s, unsigned int total_len)
Finally, in lower.pyx :
cpdef void lower(ndarray arr): cdef unsigned char * _arr _arr = <unsigned char *> arr.data _c_tolower(&_arr, arr.shape[0] * arr.itemsize)
On my laptop, I got 46 ms for understanding the list that you had above, and 37 ms for this method (and 0.8 ms for your lower_fast ), so it probably isn’t worth it, but I decided that I would print it in case you need an example of how to associate such a thing with Cython.
There are several points of improvement that I don’t know will make a big difference:
arr.data I think something like a square matrix? (I don’t know, I don’t use numpy for anything) and overlays the ends of shorter lines on \x00 s. I was too lazy to figure out how to get u8_tolower to watch for 0s, so I'm just manually fast-forwarding (which makes the sentence if (s_len == 0) ). I suspect that a single u8_tolower call will be significantly faster than doing it thousands of times.- I do a lot of free / memcpying. You can probably avoid this if you are smart.
- I think every lowercase character in Unicode is no less wide than its uppercase variant, so this should not be done in any segfaults or rewritable buffers or just intersecting string problems, but don't take my word for it.
Not quite an answer, but hope this helps in your further investigations!
PS You will notice that this makes lowering in place, so the use will be like this:
>>> alist = ['JsDated', 'Ї', '道德經', ' '] * 2 >>> arr_unicode = np.array(alist) >>> lower_2(arr_unicode) >>> for x in arr_unicode: ... print x ... jsdated ї道德經 jsdated ї道德經 >>> alist = ['JsDated', 'Ї'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct; 0.046072959899902344 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.037489891052246094
EDIT
DUH, you modify the C function to look like this:
void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len; uint8_t *lowered; lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len); memcpy(*s, lowered, lower_len); free(lowered); }
and then he does it all in one go. It looks more dangerous from the point of view of the possible presence of some of the old data remaining above lower_len shorter than the original line ... In short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES THIS ONLY DOESN'T USE THIS IN PRODUCTION THIS WILL BE POSSIBLE TO BREAK.
40% faster anyway:
>>> alist = ['JsDated', 'Ї'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.022463043975830078