Efficient way to convert string to ctypes.c_ubyte array in Python

Question

Efficient way to convert string to ctypes.c_ubyte array in Python

I have a string of 20 bytes, and I would like to convert it to an array of ctypes.c_ubyte for the purpose of manipulating bit fields.

  import ctypes str_bytes = '01234567890123456789' byte_arr = bytearray(str_bytes) raw_bytes = (ctypes.c_ubyte*20)(*(byte_arr))

Is there a way to avoid a deep copy from str to bytearray for casting?

Alternatively, is it possible to convert a string to bytearray without a deep copy? (With methods like memoryview?)

I am using Python 2.7.

Results of work:

Using eryksun and Brian Larsen's suggestion , here are the benchmarks for VMware vbox with Ubuntu 12.04 and Python 2.7.

method1 uses my original post
method2 uses ctype from_buffer_copy
method3 uses ctype cast / POINTER
method4 uses numpy

Results:

method1 takes 3.87 sec
method2 takes 0.42 s
method3 takes 1.44 s
method4 takes 8.79 seconds.

the code:

 import ctypes import time import numpy str_bytes = '01234567890123456789' def method1(): result = '' t0 = time.clock() for x in xrange(0,1000000): byte_arr = bytearray(str_bytes) result = (ctypes.c_ubyte*20)(*(byte_arr)) t1 = time.clock() print(t1-t0) return result def method2(): result = '' t0 = time.clock() for x in xrange(0,1000000): result = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes) t1 = time.clock() print(t1-t0) return result def method3(): result = '' t0 = time.clock() for x in xrange(0,1000000): result = ctypes.cast(str_bytes, ctypes.POINTER(ctypes.c_ubyte * 20))[0] t1 = time.clock() print(t1-t0) return result def method4(): result = '' t0 = time.clock() for x in xrange(0,1000000): arr = numpy.asarray(str_bytes) result = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes))) t1 = time.clock() print(t1-t0) return result print(method1()) print(method2()) print(method3()) print(method4())

+7

python bytearray ctypes

askldjd Jan 31 '14 at 15:17

source share

2 answers

As another solution for comparison (I would be very interested in the results).

Using numpy can add some simplicity depending on how the whole code looks.

 import numpy as np import ctypes str_bytes = '01234567890123456789' arr = np.asarray(str_bytes) aa = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes))) for v in aa.contents: print v 48 49 50 51 52 53 54 55 56 57 48 49 50 51 52 53 54 55 56 57

+1

Brian larsen Jan 31 '14 at 23:33

source share

eryksun · Accepted Answer · 2014-01-31T21:21:25+0000

I do not work as you think. bytearray creates a copy of the string. The interpreter then decompresses the bytearray sequence into a starargs tuple and combines it into another new tuple that has different arguments (although in this case they are not). Finally, the c_ubyte array c_ubyte loops over the args tuple to set the elements of the c_ubyte array. This is a lot of work and many copies to go through just to initialize the array.

Instead, you can use the from_buffer_copy method, assuming the string is a byte string with a buffer interface (not unicode):

 import ctypes str_bytes = '01234567890123456789' raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

It is still necessary to copy the line, but it is executed only once and is much more efficient. As pointed out in the comments, the Python string is immutable and can be interned or used as a dict key. Its immutability should be respected, even if ctypes allows you to violate this in practice:

 >>> from ctypes import * >>> s = '01234567890123456789' >>> b = cast(s, POINTER(c_ubyte * 20))[0] >>> b[0] = 97 >>> s 'a1234567890123456789'

Edit

I need to emphasize that I do not recommend using ctypes to change the immutable string of CPython. If you need to, then at least check sys.getrefcount in advance to make sure the reference count is 2 or less (the call adds 1). Otherwise, you will eventually be surprised at the internationalization of strings for names (for example, "sys" ) and code object constants. Python can reuse immutable objects as it sees fit. If you go beyond the language to mutate an “immutable” object, you have broken the contract.

For example, if you modify an already hashed string, the cache hash is no longer suitable for the content. This breaks it for use as a dict key. Neither the other line with the new content, nor one with the original content will match the key in the dict. The former has a different hash, and the latter has a different meaning. Then the only way to get to the dict element is to use a mutated string with the wrong hash. Continuing the previous example:

 >>> s 'a1234567890123456789' >>> d = {s: 1} >>> d[s] 1 >>> d['a1234567890123456789'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'a1234567890123456789' >>> d['01234567890123456789'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: '01234567890123456789'

Now consider the mess if the key is an interned string that is reused in dozens of places.

It is typical to use the timeit module for performance analysis. Prior to 3.3, timeit.default_timer is platform dependent. On POSIX systems it is time.time , while on Windows it is time.clock .

 import timeit setup = r''' import ctypes, numpy str_bytes = '01234567890123456789' arr_t = ctypes.c_ubyte * 20 ''' methods = [ 'arr_t(*bytearray(str_bytes))', 'arr_t.from_buffer_copy(str_bytes)', 'ctypes.cast(str_bytes, ctypes.POINTER(arr_t))[0]', 'numpy.asarray(str_bytes).ctypes.data_as(' 'ctypes.POINTER(arr_t))[0]', ] test = lambda m: min(timeit.repeat(m, setup))

 >>> tabs = [test(m) for m in methods] >>> trel = [t / tabs[0] for t in tabs] >>> trel [1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282]

Efficient way to convert string to ctypes.c_ubyte array in Python

More articles: