Python Idiom string pairing. Clarification required.

From http://jaynes.colorado.edu/PythonIdioms.html

"Build strings as a list and use" In the end. join is a string method called on the delimiter, not a list. Calling it from an empty string combines fragments without a separator, which is a quirk of Python and rather surprisingly. This is important: string construction with + is quadratic time instead of linear! If you study one idiom, study it.

Invalid: for s in rows: result + = s

Right: Result = '' .join (strings) "

I am not sure why this is so. If I have several lines, I want to join them, for me it is not interesting for me to put them in a list, and then call ".join". Doesn't some overhead put them on the list? Clarify...

Python command line:

>>> str1 = 'Not' >>> str2 = 'Cool' >>> str3 = ''.join([str1, ' ', str2]) #The more efficient way **A** >>> print str3 Not Cool >>> str3 = str1 + ' ' + str2 #The bad way **B** >>> print str3 Not Cool 

Is A really linear time and B is quadratic time?

+4
source share
5 answers

Yes. For the examples you chose, the importance is unclear because you only have two very short lines, so the application is likely to be faster.

But every time you do a + b with strings in Python, it calls a new selection and then copies all the bytes from a and b to a new line. If you do this in a loop with a lot of lines, these bytes must be copied again and again, and again, and each time the amount to be copied increases. This gives a quadratic behavior.

On the other hand, creating a list of strings does not copy the contents of the strings β€” it simply copies the links. It is incredibly fast and runs in linear time. Then the join method makes only one memory allocation and copies each line to the correct position only once. It also takes only linear time.

So yes, use idiom ''.join if you are potentially dealing with a lot of strings. For two lines this does not matter.

If you need to convince more, try creating a string of 10M characters yourself:

 >>> chars = ['a'] * 10000000 >>> r = '' >>> for c in chars: r += c >>> print len(r) 

Compared with:

 >>> chars = ['a'] * 10000000 >>> r = ''.join(chars) >>> print len(r) 

The first method takes about 10 seconds. The second takes less than 1 second.

+13
source

Repeated concatenation is quadratic because it is Schlemiel the Painter Algorithm (be careful that some implementations optimize this, so it is not quadratic). join avoids this because it takes the entire list of strings, allocates the necessary space, and concatenates in one pass.

+6
source

When you have s1 + s2 code, Python should highlight a new string object, copy all s1 characters into it, followed by all s2 characters. This trivial operation does not bear quadratic time costs: the cost is O(len(s1) + len(s2)) (plus a constant for distribution, but this does not appear in big-O; -).

However, consider the code in the quote you give: for s in strings: result += s .

Here, every time a new s added, all the previous ones must first be copied to the newly allocated space for result (the lines are immutable, so a new distribution and copy should take place). Suppose you have N lines of length L: you copy L characters the first time, then 2 * L a second time, then 3 * L a third time ... in everything that makes it L * N * (N+1) / 2 character that is copied ... therefore, yep, it is quadratic in N.

In some other cases, a quadratic algorithm may be faster than a linear one for small values ​​of N (since factors and fixed fixed costs can be much less); but this is not so because allocations are expensive (both directly and indirectly due to the likelihood of memory fragmentation). For comparison, the overhead for storing rows in the list is negligible.

+4
source

Joel writes about this in Back to Basics .

+1
source

This is not obvious if you mean the same thing as other people. This optimization is important when you have many lines, for example M of length N. Then

A

 x = ''.join(strings) # Takes M*N operations 

IN

 x = '' for s in strings: x = x + s # Takes N + 2*N + ... + M*N operations 

If the implementation is optimized, yes, A is linear in the total length T = M*N , and B is T*T / N , which is always worse and approximately quadratic if M >> N

Now why is this really intuitive for join : when you say β€œI have some rows”, you can usually formalize this by saying that you have an iterator that returns the rows. Now this is exactly what you are going to "string".join()

0
source

All Articles