Immutable Python strings and slices

Question

Immutable Python strings and slices

Lines in Python are immutable and support a buffer interface. Therefore, it would be effective to return not new lines, but parts of the old line when using fragments or the split () method. But, as far as I know, a new string object is built every time. Why is this so? The only reason I see it is making garbage collection a little harder.

Truth: in regular joints, the memory overhead is linear and not noticeable: copying is fast and, I believe, distribution. But in python, too much has been done to say that it is not worth the effort!

EDIT:

It seems that using this method will make memory management more difficult. The case where 1/5 of an arbitrary line is used only, and we cannot free the entire line, is an example of simle. We can improve allocalor memory, so it can partially free strings, but this is likely to be mostly a rebuttal. All standard functions can in any case be emulated with a buffer or memory if memory usage is critical. Yes, the code will not be so concise, but we must give up something in order to get something.

+7

python garbage-collection string

gukoff Aug 4 '13 at 10:42

source share

3 answers

The way slices work. Slices always make a shallow copy, letting you do something like

 >>> x = [1,2,3] >>> y = x[:]

Now one could make an exception for strings, but is it really worth it? Eric Lippert wrote about his decision not to do this for .NET ; I think his argument holds for Python as well.

See also this question .

+3

Tim pietzcker Aug 4 '13 at 10:45

source share

If you are worried about memory (in the case of really large lines), use buffer() :

 >>> a = "12345" >>> b = buffer(a, 2, 2) >>> b <read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0> >>> print b 34 >>> print b[:] 34

Knowing this allows you alternatives to string methods such as split() .

If you want a split() string, but keep the original string object (as you may need), you can do:

 def split_buf(s, needle): start = None add = len(needle) res = [] while True: index = s.find(needle, start) if index < 0: break res.append(buffer(s, start, index-start)) start = index + add return res

or using .index() :

 def split_buf(s, needle): start = None add = len(needle) res = [] try: while True: index = s.index(needle, start) res.append(buffer(s, start, index-start)) start = index + add except ValueError: pass return res

+2

glglgl Aug 4 '13 at 10:48

source share

Bakuriu · Accepted Answer · 2013-08-04T11:30:45+0000

The following is a null-terminated string representation, although it keeps track of the length, so you cannot have a string object that refers to a substring that is not a suffix. This already limits the usefulness of your proposal, as it will add many complications to deal with the sufficient and the insufficient in different ways (and the rejection of zero lines leads to other consequences).

Allowing to refer to substrings of a string means complicating garbage collection and string processing. For each row, you will need to keep track of how many objects belong to each character, or to each range of indices. This means complicating the struct string objects and any operation that is associated with them, which means it’s probably large, slows down.

Add the fact that, starting with the python3 lines, there are 3 different internal views, and everything will be too dirty to be supported, and your suggestion probably does not provide sufficient advantages for adoption.

Another problem with this “optimization” is when you want to free up “large lines”:

 a = "Some string" * 10 ** 7 b = a[10000] del a

After this operation, you have substring b , which prevents the release of a , a huge string. Of course, you could make copies of small lines, but what if b = a[:10000] (or another large number)? 10,000 characters look like a large string that should use optimization to avoid copying, but it prevents reuse of megabytes of data. The garbage collector will have to continue to check whether it is worth freeing up a large string object and making copies or not, and all these operations should be as fast as possible, otherwise you will ultimately reduce the time.

99% of the time when the lines used in the programs are "small" (max. 10 thousand characters), so copying is very fast, while the optimizations you offer become effective with really large lines (for example, they take substrings of size 100 thousand from huge texts) and much slower with really small lines, which is a common case, that is, what should be optimized.

If you consider it important, then you can offer PEP, show the implementation and the resulting changes in the speed / memory usage of your proposal. If it is really worth the effort, it may be included in a future version of python.

Immutable Python strings and slices

More articles: