First of all, you must determine what you mean by the size of the block. If you mean pieces with a fixed number of code units , then your actual algorithm may be slow, but it works. If this is not what you intend, and in fact you mean pieces with a fixed number of characters , then this will break. I discussed a similar problem in this Code Review: Split a string into pieces of the same length , then I will only repeat the relevant parts here.
One proposed (and unverified) implementation may be this:
public static IEnumerable<string> Split(this string value, int desiredLength) { var characters = StringInfo.GetTextElementEnumerator(value); while (characters.MoveNext()) yield return String.Concat(Take(characters, desiredLength)); } private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count) { for (int i = 0; i < count; ++i) { yield return (string)enumerator.Current; if (!enumerator.MoveNext()) yield break; } }
It is not optimized for speed (as you can see, I tried to keep the code short and clear with enumerations), but for large files it still works better than your implementation (see the next paragraph for a reason).
About your code, note that:
- You create a huge
ArrayList (?!) To save the result. Also note that in this way you resize the ArrayList several times (even if the specified input size and the size of the piece, its final size is known). strSegmentData rebuilt several times, if you need to accumulate characters, you must use StringBuilder , otherwise each operation will select a new line and copy the old value (it is slow, and also adds pressure to the garbage collector).
There are faster implementations (see the link for the Code Review link, especially the Heslacher implementation for the faster version), and if you do not need to handle Unicode correctly (you, sure , you only control US ASCII characters), then there is also a good one (please note that after profiling your code, you can still improve your performance for large files by pre-assigning a list of output data of the right size). I do not repeat their code here, please refer to related posts.
In your particular, you do not need to read the entire huge file in memory , you can read / parse n characters in time (do not worry about accessing the disk, I / O is buffered). This will slightly degrade performance, but it will greatly improve memory usage. Alternatively, you can read line by line (control cross-line processing).
Adriano repetti
source share