How to write super-fast file-stream code in C #?

I need to split a huge file into several small files. Each of the destination files is determined by the offset and length as the number of bytes. I am using the following code:

private void copy(string srcFile, string dstFile, int offset, int length) { BinaryReader reader = new BinaryReader(File.OpenRead(srcFile)); reader.BaseStream.Seek(offset, SeekOrigin.Begin); byte[] buffer = reader.ReadBytes(length); BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile)); writer.Write(buffer); } 

Given that I should call this function about 100,000 times, it is remarkably slow.

  1. Is there a way to get Writer to connect directly to Reader? (That is, without actually loading the contents into a buffer in memory.)
+38
performance c # cpu streaming
Jun 05 '09 at 13:41
source share
9 answers

I do not believe in .NET to allow copying a section of a file without buffering it in memory. However, it seems to me that this is still inefficient, since he needs to open the input file and search for it many times. If you just split the file, why not open the input file once, and then just write something like:

 public static void CopySection(Stream input, string targetFile, int length) { byte[] buffer = new byte[8192]; using (Stream output = File.OpenWrite(targetFile)) { int bytesRead = 1; // This will finish silently if we couldn't read "length" bytes. // An alternative would be to throw an exception while (length > 0 && bytesRead > 0) { bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length)); output.Write(buffer, 0, bytesRead); length -= bytesRead; } } } 

This has slight inefficiency when creating a buffer for each call - you may want to create a buffer once and pass it to the method:

 public static void CopySection(Stream input, string targetFile, int length, byte[] buffer) { using (Stream output = File.OpenWrite(targetFile)) { int bytesRead = 1; // This will finish silently if we couldn't read "length" bytes. // An alternative would be to throw an exception while (length > 0 && bytesRead > 0) { bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length)); output.Write(buffer, 0, bytesRead); length -= bytesRead; } } } 

Note that this also closes the output stream (due to the using statement) that your source code did not have.

The important point is that it will use the operating system file buffering more efficiently, since you are reusing the same input stream, instead of reopening the file at the beginning and then searching.

I think it will be much faster, but obviously you need to try to see ...

This involves adjacent pieces, of course. If you need to skip bits of a file, you can do this outside the method. In addition, if you write very small files, you can also optimize this situation - the easiest way to do this is to introduce a BufferedStream wrapping the input stream.

+44
Jun 05 '09 at 13:49
source share

The fastest way to input / output files with C # is to use the Windows ReadFile and WriteFile functions. I wrote a C # class that encapsulates this feature, as well as a benchmarking program that looks at differnet I / O, including BinaryReader and BinaryWriter. See my blog post at:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

+20
Mar 03 2018-11-11T00:
source share

How long is length ? You can better use a fixed (moderately large but not obscene) buffer and forget the BinaryReader ... just use Stream.Read and Stream.Write .

(edit) something like:

 private static void copy(string srcFile, string dstFile, int offset, int length, byte[] buffer) { using(Stream inStream = File.OpenRead(srcFile)) using (Stream outStream = File.OpenWrite(dstFile)) { inStream.Seek(offset, SeekOrigin.Begin); int bufferLength = buffer.Length, bytesRead; while (length > bufferLength && (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0) { outStream.Write(buffer, 0, bytesRead); length -= bytesRead; } while (length > 0 && (bytesRead = inStream.Read(buffer, 0, length)) > 0) { outStream.Write(buffer, 0, bytesRead); length -= bytesRead; } } } 
+6
Jun 05 '09 at 13:48
source share

You should not reopen the source file every time you execute a copy, it is better to open it once and pass the resulting BinaryReader to the copy function. In addition, it can help if you order your requests, so you do not make big jumps inside the file.

If the lengths are not too long, you can also try to group several copy calls by grouping offsets that are next to each other and reading the entire block you need, for example:

 offset = 1234, length = 34 offset = 1300, length = 40 offset = 1350, length = 1000 

can be grouped into one:

 offset = 1234, length = 1074 

Then you only need to β€œsearch” in your buffer and write there three new files without having to read again.

+3
Jun 05 '09 at 13:49
source share

Do you think that you are using CCR, since you are writing separate files, you can do everything in parallel (read and write), and CCR simplifies this.

 static void Main(string[] args) { Dispatcher dp = new Dispatcher(); DispatcherQueue dq = new DispatcherQueue("DQ", dp); Port<long> offsetPort = new Port<long>(); Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort, new Handler<long>(Split))); FileStream fs = File.Open(file_path, FileMode.Open); long size = fs.Length; fs.Dispose(); for (long i = 0; i < size; i += split_size) { offsetPort.Post(i); } } private static void Split(long offset) { FileStream reader = new FileStream(file_path, FileMode.Open, FileAccess.Read); reader.Seek(offset, SeekOrigin.Begin); long toRead = 0; if (offset + split_size <= reader.Length) toRead = split_size; else toRead = reader.Length - offset; byte[] buff = new byte[toRead]; reader.Read(buff, 0, (int)toRead); reader.Dispose(); File.WriteAllBytes("c:\\out" + offset + ".txt", buff); } 

This code places the offsets on the CCR port, which causes Thread to be created to execute the code in the Split method. This forces you to open the file several times, but eliminates the need for synchronization. You can make it more memory efficient, but you have to sacrifice speed.

+3
Jun 05 '09 at 14:57
source share

The first thing I would recommend is to take measurements. Where do you waste your time? Is it in reading or writing?

Over 100,000 hits (summarize time): How much time was spent allocating a buffer array? How much time is spent opening a file for reading (is it the same file each time?) How much time is spent on reading and writing operations?

If you do not make any conversions to the file, do you need a BinaryWriter or can you use a stream to write? (try, you get the same result? does it save time?)

+1
Jun 05 '09 at 13:52
source share

Using FileStream + StreamWriter I know that you can create massive files in less time (less than 1 minute 30 seconds). I create three files with a total volume of 700+ megabytes from a single file using this technique.

The main problem with the code used is that you open the file every time. This creates the file I / O overhead.

If you knew the names of the files that you previously generated, you can extract File.OpenWrite in a separate method; it will increase speed. Without seeing the code that determines how you split files, I don't think you can get much faster.

+1
Jun 05 '09 at 15:31
source share

No one offers streams? Writing small files looks like an example of a text book where threads are useful. Set up a thread chain to create smaller files. in this way, you can create them all in parallel, and you do not need to wait for each of them to complete. My guess is that creating files (disk operation) will take WAY longer than sharing data. and, of course, you must first make sure that a consistent approach is not adequate.

0
Jun 05 '09 at 14:21
source share

(for future reference)

Most likely, the fastest way to do this is to use files with memory mapping (thus, first of all, copying the memory and the OS that process the reading / writing of the file through the management of paging / memory).

Mapped memory files are supported in managed code in .NET 4.0.

But, as already noted, you need to configure the profile and wait for the transition to your own code for maximum performance.

-one
Jun 05 '09 at 14:08
source share



All Articles