Python C Unicode Arguments

I have a simple python script

import _tph str = u', <b>!</b>' # Some unicode string with a russian characters _tph.strip_tags(str) 

and the C library, which is compiled in _tph.so. This is the strip_tags function from it:

 PyObject *strip_tags(PyObject *self, PyObject *args) { PyUnicodeObject *string; Py_ssize_t length; PyArg_ParseTuple(args, "u#", &string, &length); printf("%d, %d\n", string->length, length); // ... } 
Function

printf prints this: 1080, 19. So the length of str is really 19 characters, but what the hell do I get these 1080 characters from?

When I type string , I got my str , null char, and then a lot of junk.

Unwanted memory looks like this:

u '\ u041f \ u0440 \ u0438 \ u0432 \ u0435 \ u0442, <b> \ u043c \ u0438 \ u0440! </b> \ x00 \ x00 \ u0299 \ Ub7024000 \ U08c55800 \ Ub7025904 \ x00 \ Ub777351c \ U08c79e58 \ x00 \ U08c7a0b4 \ x00 \ Ub7025904 \ Ub7025954 \ Ub702594c \ Ub702595c \ U00702594c \ U0070259492 \ U0070259292 \ U0070259292 \ U0070259292 \ U0070259492 \ U0070259492 \ U0070259492 \ U0070259492 \ U0070259492 \ U0070259492 \

How can I get a normal line here?

+4
source share
1 answer

The string argument has no name here. This is a pointer to a Python Unicode object, so your printf sees a lot of binary data (object type, GC headers, number of links, and Unicode encoded codes) until it searches for a null byte that interprets printf as the end of a line.

The easiest way to view a string is PyObject_Print(string) . You can find C functions for managing Python Unicode objects at: http://docs.python.org/c-api/unicode.html#unicode-objects

+5
source

All Articles