The fastest way to break huge text into smaller pieces

I used the code below to split the string, but it takes a lot of time.

using (StreamReader srSegmentData = new StreamReader(fileNamePath)) { string strSegmentData = ""; string line = srSegmentData.ReadToEnd(); int startPos = 0; ArrayList alSegments = new ArrayList(); while (startPos < line.Length && (line.Length - startPos) >= segmentSize) { strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); startPos = startPos + segmentSize; } } 

Please suggest me an alternative way to split a string into smaller pieces of a fixed size

+7
substring c #
source share
2 answers

First of all, you must determine what you mean by the size of the block. If you mean pieces with a fixed number of code units , then your actual algorithm may be slow, but it works. If this is not what you intend, and in fact you mean pieces with a fixed number of characters , then this will break. I discussed a similar problem in this Code Review: Split a string into pieces of the same length , then I will only repeat the relevant parts here.

  • You partition into Char , but String is encoded in UTF-16, then you can create broken strings in at least three cases:

    • One character is encoded by more than one code block . The Unicode code point for this character is encoded as two UTF-16 code blocks, each code block may end up in two different slices (and both lines will be invalid).
    • One character consists of more than one code point . You are dealing with a character created by two separate Unicode code points (for example, the Han character).
    • One character has a combination of characters or modifiers . This is more common than you think: for example, Unicode combining the U + 0300 COMBINING GRAVE ACCENT character used to create Γ  and Unicode modifiers such as U + 02BC MODIFIER LETTER APOSTROPHE.
  • The definition of character for a programming language and for a person is quite different, for example, in the Slovak dΕΎ - one character, however it is made by 2/3 Unicode codes, which in this case also 2/3 UTF-16 then "dΕΎ".Length > 1 . Learn more about this and other culture issues on How can I use a character familiar with Unicode by comparing characters? .
  • There are ligatures. Assuming that one ligature is one code point (and also assuming that it is encoded as one unit of code), then you will consider it as one character, but it represents two characters. What to do in this case? In the general case, the definition of character can be rather vague, because it has a different meaning in accordance with the discipline in which this word is used. You cannot (possibly) handle everything correctly, but you must set some restrictions and document code behavior.

One proposed (and unverified) implementation may be this:

 public static IEnumerable<string> Split(this string value, int desiredLength) { var characters = StringInfo.GetTextElementEnumerator(value); while (characters.MoveNext()) yield return String.Concat(Take(characters, desiredLength)); } private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count) { for (int i = 0; i < count; ++i) { yield return (string)enumerator.Current; if (!enumerator.MoveNext()) yield break; } } 

It is not optimized for speed (as you can see, I tried to keep the code short and clear with enumerations), but for large files it still works better than your implementation (see the next paragraph for a reason).

About your code, note that:

  • You create a huge ArrayList (?!) To save the result. Also note that in this way you resize the ArrayList several times (even if the specified input size and the size of the piece, its final size is known).
  • strSegmentData rebuilt several times, if you need to accumulate characters, you must use StringBuilder , otherwise each operation will select a new line and copy the old value (it is slow, and also adds pressure to the garbage collector).

There are faster implementations (see the link for the Code Review link, especially the Heslacher implementation for the faster version), and if you do not need to handle Unicode correctly (you, sure , you only control US ASCII characters), then there is also a good one (please note that after profiling your code, you can still improve your performance for large files by pre-assigning a list of output data of the right size). I do not repeat their code here, please refer to related posts.

In your particular, you do not need to read the entire huge file in memory , you can read / parse n characters in time (do not worry about accessing the disk, I / O is buffered). This will slightly degrade performance, but it will greatly improve memory usage. Alternatively, you can read line by line (control cross-line processing).

+12
source share

Below is my analysis of your question and code (read the comments)

 using (StreamReader srSegmentData = new StreamReader(fileNamePath)) { string strSegmentData = ""; string line = srSegmentData.ReadToEnd(); // Why are you reading this till the end if it is such a long string? int startPos = 0; ArrayList alSegments = new ArrayList(); // Better choice would be to use List<string> while (startPos < line.Length && (line.Length - startPos) >= segmentSize) { strSegmentData = strSegmentData + line.Substring(startPos, segmentSize) + Environment.NewLine; // Seem like you are inserting linebreaks at specified interval in your original string. Is that what you want? alSegments.Add(line.Substring(startPos, segmentSize) + Environment.NewLine); // Why are you recalculating the Substring? Why are you appending the newline if the aim is to just "split" startPos = startPos + segmentSize; } } 

Having made all the assumptions, below is the code that I would recommend for splitting a long string. This is just a clean way to do what you do in the sample. You can optimize this, but don’t know how fast you are looking.

 static void Main(string[] args) { string fileNamePath = "ConsoleApplication1.pdb"; var segmentSize = 32; var op = ReadSplit(fileNamePath, segmentSize); var joinedSTring = string.Join(Environment.NewLine, op); } static List<string> ReadSplit(string filePath, int segmentSize) { var splitOutput = new List<string>(); using (var file = new StreamReader(filePath, Encoding.UTF8, true, 8 * 1024 )) { char []buffer = new char[segmentSize]; while (!file.EndOfStream) { int n = file.ReadBlock(buffer, 0, segmentSize); splitOutput.Add(new string(buffer, 0, n)); } } return splitOutput; } 

I have not tested performance tests in my version, but I think it is faster than your version.

Also, I'm not sure how you plan to use the output, but a good optimization when doing I / O is to use asynchronous calls. And good optimization (due to readability and complexity) when processing large string is to stick with char[]

note that

  • You may have to deal with character encoding problems while reading a file.
  • If you already have a long line in memory, and reading the file is simply included in the demo, then you should use the StringReader class instead of the StreamReader class
0
source share

All Articles