Reading large text files with streams in C #

I have a wonderful task to develop how to handle large files uploaded to our script application editor (this is similar to VBA for our internal product for fast macros). Most files are around 300-400 kilobytes, which is a good download. But when they go beyond 100 MB, the process has a difficult time (as one would expect).

What happens is that the file is read and moved to a RichTextBox, which is then moved - don't worry too much about this part.

The developer who wrote the source code just uses StreamReader and does

[Reader].ReadToEnd() 

which can take quite a while.

My task is to break this bit of code, read it in pieces to the buffer and show the progress panel with the possibility of canceling it.

Some assumptions:

  • Most files will be 30-40 MB
  • The contents of the file are textual (not binary), some are Unix-formatted, some DOS.
  • After receiving the content, we will find out which terminator is used.
  • No one was worried when he uploaded the time needed to render in richtextbox. This is just bootstrapping text.

Now for the questions:

  • Can I just use StreamReader and then check the Length property (so that ProgressMax) and give the Read value for the given buffer size and iterate through the WHILST loop inside the background so that it does not block the main UI thread? Then return the stringbuilder to the main thread after it finishes.
  • Content will be passed to StringBuilder. can i initialize StringBuilder with stream size if length is available?

Are these (in your professional opinions) good ideas? In the past, I had several problems reading content from Streams, because it will always skip the last few bytes or something like that, but I will ask another question if that is the case.

+68
c # stream large-files streamreader
Jan 29 '10 at
source share
10 answers

You can improve read speed with BufferedStream, for example:

 using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) using (BufferedStream bs = new BufferedStream(fs)) using (StreamReader sr = new StreamReader(bs)) { string line; while ((line = sr.ReadLine()) != null) { } } 

March 2013 UPDATE

I recently wrote code to read and process (search for text) 1 GB-ish text files (much larger than the files used here) and achieved a significant performance boost using the producer / consumer pattern. The manufacturer’s task is read in lines of text using a BufferedStream and transferred to a separate consumer task that performed the search.

I used this as an opportunity to learn the TPL Dataflow, which is very well suited for quickly coding this template.

Why BufferedStream is faster

A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls in the operating system. Buffers improve read and write performance. The buffer can be used to read or write, but never at the same time. BufferedStream's read and write methods automatically support a buffer.

December 2014 UPDATE: Your mileage may vary.

Based on the comments, FileStream should use BufferedStream internally. At the time this answer was first provided, I measured a significant performance improvement by adding BufferedStream. At that time I was guided by .NET 3.x on a 32-bit platform. Today, focusing on .NET 4.5 on a 64-bit platform, I do not see any improvements.

Similar

I came across a situation where the streams of a large, generated CSV file into the Response stream from an ASP.Net MVC action were very slow. Adding a BufferedStream in this case improved performance by 100x. See Unbuffered output very slowly for more details.

+137
Mar 10 '12 at 1:22
source share

You say that you were asked to show a progress bar when downloading a large file. Is it because users really want to see the exact percentage of file downloads, or simply because they need visual feedback, is something happening?

If the latter is true, then the solution becomes much simpler. Just make reader.ReadToEnd() in the background thread and display the progress bar of the step, not the correct one.

I raise this point because, in my experience, this often happens. When you write a data processing program, users will certainly be interested in a% full figure, but for simple but slow updates to the user interface, they most likely just want to know that the computer has not crashed. :-)

+14
Jan 29 '10 at 13:03
source share

If you read the performance and benchmarks on this website , you will see that the fastest way to read (because reading and processing are all different) the text file is the following code fragment:

 using (StreamReader sr = File.OpenText(fileName)) { string s = String.Empty; while ((s = sr.ReadLine()) != null) { //do your stuff here } } 

All about 9 different methods were marked with benches, but in most cases they come forward, even performing a buffered reader, as other readers have mentioned.

+12
Sep 19 '14 at 14:21
source share

For binary files, the fastest way to read them I found this.

  MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(file); MemoryMappedViewStream mms = mmf.CreateViewStream(); using (BinaryReader b = new BinaryReader(mms)) { } 

In my tests, it is hundreds of times faster.

+7
Sep 30 '14 at 12:38 on
source share

Use a background worker and read only a limited number of lines. Read more only when scrolling user.

And try never to use ReadToEnd (). This is one of the features that you think “why did they do this?”; this is a script kiddies helper that is great for small things, but as you can see, it sucks for large files ...

Those who tell you to use StringBuilder should read the MSDN more often:

Performance features
The Concat and AppendFormat methods merge new data with an existing String or StringBuilder object. The String concatenation operation always creates a new object from an existing string and new data. The StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is added to the end of the buffer if a room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, and then new data is added to the new buffer. The performance of the concatenation operation for a String or StringBuilder object depends on how often memory allocation occurs.

The string concatenation operation always allocates memory, while the StringBuilder concatenation operation only allocates memory if the StringBuilder object's buffer is too small to accommodate new data. Therefore, the String class is preferred for the concatenation operation if a fixed number of String objects are concatenated. In this case, individual concatenation operations can even be combined into a single operation by the compiler. A StringBuilder is preferred for concatenation if an arbitrary number of strings are concatenated; for example, if a loop combines a random number of lines of user input.

This means a huge allocation of memory, which is becoming a big use of the swap file system, which mimics the partitions of your hard drive to act as RAM, but the hard drive is very slow.

The StringBuilder parameter is great for those who use the system as a single user, but when you have two or more users viewing large files at the same time, you have a problem.

+6
Jan 29
source share

That should be enough to get you started.

 class Program { static void Main(String[] args) { const int bufferSize = 1024; var sb = new StringBuilder(); var buffer = new Char[bufferSize]; var length = 0L; var totalRead = 0L; var count = bufferSize; using (var sr = new StreamReader(@"C:\Temp\file.txt")) { length = sr.BaseStream.Length; while (count > 0) { count = sr.Read(buffer, 0, bufferSize); sb.Append(buffer, 0, count); totalRead += count; } } Console.ReadKey(); } } 
+5
Jan 29
source share

See the following code snippet. You mentioned Most files will be 30-40 MB . This claims to read 180 MB in 1.4 seconds on the Intel Quad Core:

 private int _bufferSize = 16384; private void ReadFile(string filename) { StringBuilder stringBuilder = new StringBuilder(); FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read); using (StreamReader streamReader = new StreamReader(fileStream)) { char[] fileContents = new char[_bufferSize]; int charsRead = streamReader.Read(fileContents, 0, _bufferSize); // Can't do much with 0 bytes if (charsRead == 0) throw new Exception("File is 0 bytes"); while (charsRead > 0) { stringBuilder.Append(fileContents); charsRead = streamReader.Read(fileContents, 0, _bufferSize); } } } 

Original article

+4
Jan 29 '10 at 12:52
source share

You might be better off using memory mapped files here . Support for memory mapped files will be in .NET 4 (I think ... I heard that someone else is talking about this), therefore this shell, which uses p / invokes, does the same job.

Edit: See here MSDN for how it works, here's a blog for how it does in upcoming .NET 4 when it comes out as an issue. The link I gave earlier is a wrapper around pinvoke to achieve this. You can display the entire file in memory and view it as a sliding window when scrolling through the file.

+2
Jan 29 '10 at 12:52
source share

An iterator may be ideal for this type of work:

 public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData) { const int charBufferSize = 4096; using (FileStream fs = File.OpenRead(filename)) { using (BinaryReader br = new BinaryReader(fs)) { long length = fs.Length; int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1; double iter = 100 / Convert.ToDouble(numberOfChunks); double currentIter = 0; yield return Convert.ToInt32(currentIter); while (true) { char[] buffer = br.ReadChars(charBufferSize); if (buffer.Length == 0) break; stringData.Append(buffer); currentIter += iter; yield return Convert.ToInt32(currentIter); } } } } 

You can call it using the following:

 string filename = "C:\\myfile.txt"; StringBuilder sb = new StringBuilder(); foreach (int progress in LoadFileWithProgress(filename, sb)) { // Update your progress counter here! } string fileData = sb.ToString(); 

When the file is downloaded, the iterator will return a run number from 0 to 100, which you can use to update your progress bar. Once the loop is finished, StringBuilder will contain the contents of the text file.

In addition, since you want text, we can simply use the BinaryReader to read in characters, which ensures that your buffers are correctly aligned when reading any multi-byte characters ( UTF-8 , UTF-16 , etc.).

All this is done without the use of background tasks, threads, or complex user states.

+1
Jul 09 '10 at 18:35
source share

I know these questions are pretty old, but I found it the other day and tested the recommendation for a MemoryMappedFile, and this is the fastest method. The comparison is reading a file with a resolution of 7,616,939 lines of 345 MB using the readline method, which takes 12 hours on my machine to perform the same load, and reading through MemoryMappedFile takes 3 seconds.

I wanted to post this in the comments on this proposal, but my “representative” is not tall enough to do this. I wanted to draw attention to this because I searched the Internet and tested every recommendation I could find, to go back and check MemoryMapedFile successfully.

0
Mar 13 '17 at 11:20
source share



All Articles