Question
Why, in CPython,
def add_string(n): s = '' for _ in range(n): s += ' '
take linear time but
def add_string_in_list(n): l = [''] for _ in range(n): l[0] += ' '
takes quadratic time?
Evidence:
Timer(partial(add_string, 1000000)).timeit(1) #>>> 0.1848409200028982 Timer(partial(add_string, 10000000)).timeit(1) #>>> 1.1123797750042286
Timer(partial(add_string_in_list, 10000)).timeit(1) #>>> 0.0033865350123960525 Timer(partial(add_string_in_list, 100000)).timeit(1) #>>> 0.25131178900483064
What i know
CPython has optimizations for adding a line when the line being added has a reference count of 1.
This is because strings in Python are immutable, and therefore they usually cannot be edited. If there are several links for a line and it is mutated, both links will see the changed line. This is clearly not required, so a mutation cannot happen with multiple links.
If there is only one link to a string, however, changing the value will only change the string for that link, which requires changing it. You can check that this is a probable reason:
from timeit import Timer from functools import partial def add_string_two_references(n): s = '' for _ in range(n): s2 = s s += ' ' Timer(partial(add_string_two_references, 20000)).timeit(1)
I am not sure why the factor is only 30 times instead of the expected 100x, but I believe that this is an overhead.
What I do not know
So why does the list version create two links? Is it even that which hinders optimization?
You can verify that this does not apply to normal objects in different ways:
class Counter: def __iadd__(self, other): print(sys.getrefcount(self)) s = Counter() s += None
optimization python string cpython
Veedrac Jun 04 '14 at 14:28 2014-06-04 14:28
source share