Why are strings copied to .NET?

Since strings are immutable in .NET, why are they copied for simple operations like Substring or Split ? For example, by storing char[] value , int start and int length , you can create a substring to simply point to an existing string, and we could save the overhead of copying the string for many simple operations. Therefore, I wonder why it was decided to copy lines for such operations?

For example, was this done to support the current implementation of StringBuilder ? Or to avoid referencing a large char[] when only a few characters are required? Or any other reason you can think of? Can you offer the pros and cons for such a design?

As @cletus mentioned and supported by @Jon Skeet, this is more like asking why .NET strings were built differently than Java in this aspect.

+4
source share
6 answers

This is basically the way Java works. There are several advantages of the .NET, IMO method:

  • Link locality - data and length are in the same place
  • Less difference - data is at a fixed point inside the row object itself; no need to dereference another char array
  • Lack of smoothing when you have one character substring of the initially large string, as Renault mentioned.
  • As a result, you have fewer objects and variables. In the case of the .NET string (except for the empty buffer space), the total size (on x86) is approximately 20+2*n bytes. In Java, you have the byte size of the array ( 12 + 2*n ) and the string itself (24 bytes: object overhead, reference, start and counter, and also cache the hash if it ever computed it). Thus, for an empty string, the .NET version takes about 20 bytes compared to Java 36. Of course, this is the worst case, and it will only be a “permanent difference”, but if you use a lot of independent strings, which could eventually become significant . More for the garbage collector to see, too.

Of course, the advantages are that less space is required if the foregoing does not occur.

In the end, it will depend on your use — the compiler and runtime cannot predict which usage pattern is more likely in your exact code.

The current string representation may also have the benefits of interaction, but I don't know enough about that to say for sure.

EDIT: I don't know why your question received so many hostile answers. This, of course, is not a "dumb" way of representing a string, and it clearly works. I believe that the fear of data loss and its complexity is just FUD in this case - the Java string implementation is simple and reliable. I personally suspect that the .NET way of doing things is more efficient in most programs, and I suspect that MS has done research to test this, but there will certainly be situations where the “general” model works better.

+10
source

If you reused the same string to return substrings, what happens when the main string goes out of scope?

In the best case, it will need to remain in memory and cannot be collected until all substrings have been released, so you get more memory.

This is just one of the problems.

In fact, the garbage collector will have several options:

  • keep the entire source string in memory, even using only a very short substring.

  • Release the parts of the original string that are not referenced, and keep the substring where it is. This will create a lot of fragmentation, which means that the garbage collector will probably have to move the lines at some point: we will still make a copy.

I am sure that it has its own use cases, and sometimes it can be more efficient when working with substrings (say, when working with large XML documents).
However, as John said, Java string objects require more space, so if you have many small strings, they can actually use more memory than the .Net method.

This is a compromise.
I think that if you are in a situation where it is really important how memory is managed and you need to have absolutely predictable behavior, neither Java nor .NET will be the best tools.

We use garbage collectors because they are optimized for efficient operation in the vast majority of cases.
Knowing how they work is important, but regardless of whether they use row reuse or not, this is more of an optimization left to the underlying platform, and it should not flow too much to the surface. GC, after all, is here to help us.

+5
source

In your substring example, this would mean that we re-execute the substring logic every time we make a reference to a "new" string. The overhead of this alone makes it pretty obvious why we are copying the lines.

+1
source

I think the key highlights the difference between:

  • immutable string
  • A line that is immutable exists for all eternity

What you say will work if the lines were # 2. However, although the lines are immutable, they can be destroyed.

As you can see further, they have their own costs:

  • immutable string - always copy as you mentioned
  • a line that is immutable exists for all eternity - saving every line created forever

It's easy to see why # 1 would be better :)

(But I don't mean that No. 2 is bad or dumb)

0
source

Believe me, you hate it unless the lines are immutable. To give you an example from Java: java.util.Date is volatile and it is a nightmare. Basiclaly forces everyone who receives data as a parameter, or returns a function, must copy it.

I can’t speak for .Net strings, but the Java substring operation actually refers to the main string, that is, each string in Java has approximately 16-20 byte overhead (a pointer to a string, a start index, an end index, length, and possibly something else). This has both advantages and disadvantages. It can be a real “catch” in terms of hunger in memory. In one project I was working on, we used massive memory. It turned out that we receive large messages (thousands of characters) and process them with substrings. Because substrings kept a reference to the source string, the source string was never cleared.

Now you can get around this using the String constructor, but this is not obvious, and many people do not know this.

Basically, substrings, as you say, are a real can of worms. Be careful what you want.

0
source

If the string object contains a link to the character data, this will mean that most strings will be two objects instead of one.

0
source

All Articles