Speed ​​comparison startswith (). Vs. in()

I got the impression that startswith should be faster than in for the simple reason that in should do more checks (allows you to search for a word, anywhere on the line). But I had doubts, so I decided timeit . The code for the timings is given below, and, as you will probably notice, I did not do much time; the code is pretty simple.

 import timeit setup1=''' def in_test(sent, word): if word in sent: return True else: return False ''' setup2=''' def startswith_test(sent, word): if sent.startswith(word): return True else: return False ''' print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1)) print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2)) 

Results:

 >> in: 0.11912814951705597 >> startswith: 0.22812353561129417 

So, startswith is twice as slow! .. I find this behavior very perplexing, considering what I said above. Am I doing something wrong with timing two or in really faster? If so, why?

Note that the results are very similar, even when both return False (in this case, in would have to actually go through the whole trick if it was just shorted earlier):

 print(timeit.timeit('in_test("another standard sentence, would be that", "this")', setup=setup1)) print(timeit.timeit('startswith_test("another standard sentence, would be that", "this")', setup=setup2)) >> in: 0.12854891578786237 >> startswith: 0.2233201940338861 

If I had to implement two functions from scratch, it would look something like this (pseudocode):

startswith : start comparing the letters of the word with the letters of the sentence one by one until the word ends) (return True) or b) the check returns False (return False)

in : call startswith for each position, where in the sentence you can find the initial letter of the word.

I just do not understand.


Just to make it clear, in and startswith not equivalent ; I'm just talking about the case when the word that is trying to find should be the first in the line.

+8
performance python time
source share
3 answers

This is because you need to search and call a method. in is specialized and leads directly to COMPARE_OP (a call to cmp_outcome , which in turn calls PySequence_Contains ), and str.startswith passes a slow bytecode:

 2 LOAD_ATTR 0 (startswith) 4 LOAD_FAST 1 (word) 6 CALL_FUNCTION 1 # the slow part 

Replacing in with __contains__ , calling the function call for this case, largely negates the difference in speed:

 setup1=''' def in_test(sent, word): if sent.__contains__(word): return True else: return False ''' 

And, timings:

 print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1)) print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2)) 0.43849368393421173 0.4993997460696846 

in wins here due to the fact that he does not need to go through the whole function of the call invocation and because of the favorable case that he introduced.

+8
source share

You are comparing the statement line by line -vs- attribute search and function call. The second will have more overhead, even if the first takes a lot of time on a lot of data.

Also, you are looking for the first word, so if it matches, in will look at as much data as startswith() . To see the difference, you should look at the pessimistic case (results not found or do not match at the end of the line):

 setup1=''' data = "xxxx"*1000 def .... print(timeit.timeit('in_test(data, "this")', setup=setup1)) 0.932795189000899 print(timeit.timeit('startswith_test(data, "this")', setup=setup2)) 0.22242475600069156 
+5
source share

If you look at the bytecode created by your functions:

 >>> dis.dis(in_test) 2 0 LOAD_FAST 1 (word) 3 LOAD_FAST 0 (sent) 6 COMPARE_OP 6 (in) 9 POP_JUMP_IF_FALSE 16 3 12 LOAD_CONST 1 (True) 15 RETURN_VALUE 5 >> 16 LOAD_CONST 2 (False) 19 RETURN_VALUE 20 LOAD_CONST 0 (None) 23 RETURN_VALUE 

You will notice that there is a lot of overhead not directly related to string matching. Running a test for a simpler function:

 def in_test(sent, word): return word in sent 

will be more reliable.

0
source share

All Articles