Why is a line starting slower than in?

Question

Why is a line starting slower than in?

Surprisingly, startswith slower than in :

 In [10]: s="ABCD"*10 In [11]: %timeit s.startswith("XYZ") 1000000 loops, best of 3: 307 ns per loop In [12]: %timeit "XYZ" in s 10000000 loops, best of 3: 81.7 ns per loop

As we all know, the in operation should look for the entire string, and startswith just needs to check the first few characters, so startswith should be more efficient.

When s is large enough, startswith is faster:

 In [13]: s="ABCD"*200 In [14]: %timeit s.startswith("XYZ") 1000000 loops, best of 3: 306 ns per loop In [15]: %timeit "XYZ" in s 1000000 loops, best of 3: 666 ns per loop

So it seems that the startswith call has some overhead, which makes it slower when the line is small.

And than I tried to figure out what the overhead of calling startswith .

First, I used the variable f to reduce the cost of the dot operation — as indicated in this — here we see that startswith is still slower:

 In [16]: f=s.startswith In [17]: %timeit f("XYZ") 1000000 loops, best of 3: 270 ns per loop

In addition, I checked the cost of an empty function call:

 In [18]: def func(a): pass In [19]: %timeit func("XYZ") 10000000 loops, best of 3: 106 ns per loop

Regardless of the cost of the point operation and the function call, the startswith time is about (270-106) = 164 ns, but the in operation takes only 81.7 ns. There seems to be some more overhead for startswith what is it?

Add a test result between startswith and __contains__ , as suggested by poke and lvc:

 In [28]: %timeit s.startswith("XYZ") 1000000 loops, best of 3: 314 ns per loop In [29]: %timeit s.__contains__("XYZ") 1000000 loops, best of 3: 192 ns per loop

+54

python python-2.7 cpython startswith python-internals

WKPlus Aug 10 '15 at 10:35

source share

2 answers

Probably because str.startswith() does more than str.__contains__() , and also because I believe str.__contains__ works completely in C, whereas str.startswith() should interact with Python types. Its signature is str.startswith(prefix[, start[, end]]) , where the prefix can be a tuple of strings to try.

+1

Cyphase Aug 10 '15 at 10:49

source share

poke · Accepted Answer · 2015-08-10 10:52

As mentioned in the comments, if you use s.__contains__("XYZ") , you will get a result that is more like s.startswith("XYZ") because it needs to take the same route: search for a member by a string object, followed by a function call. This, as a rule, is somewhat expensive (not enough for you, of course, to have to worry). On the other hand, when you execute "XYZ" in s , the parser interprets the statement and can reduce the member’s access to __contains__ (or, rather, the implementation behind it, since __contains__ alone is just one way to access the implementation).

You can get an idea of this by looking at the bytecode:

 >>> dis.dis('"XYZ" in s') 1 0 LOAD_CONST 0 ('XYZ') 3 LOAD_NAME 0 (s) 6 COMPARE_OP 6 (in) 9 RETURN_VALUE >>> dis.dis('s.__contains__("XYZ")') 1 0 LOAD_NAME 0 (s) 3 LOAD_ATTR 1 (__contains__) 6 LOAD_CONST 0 ('XYZ') 9 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 12 RETURN_VALUE

So comparing s.__contains__("XYZ") with s.startswith("XYZ") will produce a more similar result, however for the line of your example, s startswith will still be slower.

To get to this, you can check the implementation of both. It is interesting to see what the implementation contains is that it is statically typed and simply assumes that the argument is a unicode object. So it is quite effective.

startswith implementation , however, is a “dynamic” Python method that requires the implementation to actually parse the arguments. startswith also supports the tuple as an argument, which makes the whole method launch slower: (shortened by me, with my comments):

 static PyObject * unicode_startswith(PyObject *self, PyObject *args) { // argument parsing PyObject *subobj; PyObject *substring; Py_ssize_t start = 0; Py_ssize_t end = PY_SSIZE_T_MAX; int result; if (!stringlib_parse_args_finds("startswith", args, &subobj, &start, &end)) return NULL; // tuple handling if (PyTuple_Check(subobj)) {} // unicode conversion substring = PyUnicode_FromObject(subobj); if (substring == NULL) {} // actual implementation result = tailmatch(self, substring, start, end, -1); Py_DECREF(substring); if (result == -1) return NULL; return PyBool_FromLong(result); }

This is probably the big reason why startswith is slower for strings for which a contains is fast because of its simplicity.

Why is a line starting slower than in?

More articles: