Processing huge files in C #

I have a 4Gb file in which I want to search and replace bytes. I wrote a simple program, but it takes too much time (90 minutes +) to do just one search and a replacement. Several hex editors that I tried can complete the task in less than 3 minutes and not load the entire target file into memory. Does anyone know a method where I can do the same thing? Here is my current code:

public int ReplaceBytes(string File, byte[] Find, byte[] Replace) { var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite); int FindPoint = 0; int Results = 0; for (long i = 0; i < Stream.Length; i++) { if (Find[FindPoint] == Stream.ReadByte()) { FindPoint++; if (FindPoint > Find.Length - 1) { Results++; FindPoint = 0; Stream.Seek(-Find.Length, SeekOrigin.Current); Stream.Write(Replace, 0, Replace.Length); } } else { FindPoint = 0; } } Stream.Close(); return Results; } 

Find and replace relatively small compared to the 4Gb file. I can easily understand why my algorithm is slow, but I'm not sure how I could do it better.

+7
source share
5 answers

Part of the problem may be that you are reading the stream one byte at a time. Try to read large pieces and replace them. I'll start with 8kb and then test with larger or smaller snippets to see which gives you better performance.

+3
source

There are many better algorithms for finding a substring in a string (which is basically what you do)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

Their essence is that you can skip a lot of bytes while parsing your substring. Here is a simple example.

4GB File starts with: ABCDEFGH I JKLMNOP

Substring: NOP

  • You skip the length of substring-1 and check the last byte, so compare C with P
  • It does not match, so the substring is not the first 3 bytes
  • Also, C is not at all in the substring, so you can skip 3 more bytes (len substrings)
  • Compare F with P, does not match, F is not in a substring, skip 3
  • Compare me with P, etc. etc.

If you match, go back. If the character does not match, but is in the substring, then you need to do some comparison at this point (read the link for details)

+3
source

Instead of reading a file byte by byte, read it by buffer:

 buffer = new byte[bufferSize]; currentPos = 0; length = (int)Stream .Length; while ((count = Stream.Read(buffer, currentPos, bufferSize)) > 0) { currentPos += count; .... } 
+2
source

Another, easier way to read more than one byte at a time:

 var Stream = new BufferedStream(new FileStream(File, FileMode.Open, FileAccess.ReadWrite)); 

Combining this with Said Amiri’s example of how to read into the buffer, and one of the best binary search / replace algorithms should give you better results.

+1
source

You should try using memory mapped files . C # has supported them since version 4.0.

A memory mapped file contains the contents of the file in virtual memory.

Persistent files are memory mapped files that are associated with the source file on disk. When the last process finishes working with the file, the data will be saved to the original file on disk. These memory mapped files are suitable for working with extremely large source files.

+1
source

All Articles