How to combine efficient giant files with C #

I have over 125 ~ 100 MB TSV files that I want to merge. The merge operation allows you to destroy 125 files, but not data. The important thing is that at the end I get a large file of the contents of all the files one after another (in no particular order).

Is there an effective way to do this? I was wondering if Windows provides an API to just make a big β€œunion” of all these files? Otherwise, I will have to read all the files and write a large one.

Thanks!

+7
c # filesystems file-io
source share
4 answers

So, is "merging" really just writing files one by one? It's quite simple - just open one output stream and then reopen the input stream, copy the data, close it. For example:

static void ConcatenateFiles(string outputFile, params string[] inputFiles) { using (Stream output = File.OpenWrite(outputFile)) { foreach (string inputFile in inputFiles) { using (Stream input = File.OpenRead(inputFile)) { input.CopyTo(output); } } } } 

This is the use of the Stream.CopyTo method, which is new in .NET 4. If you are not using .NET 4, you will need another helper method:

 private static void CopyStream(Stream input, Stream output) { byte[] buffer = new byte[8192]; int bytesRead; while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0) { output.Write(buffer, 0, bytesRead); } } 

Nothing that I know is more efficient than that ... but, importantly, it does not take up much memory on your system at all. It does not seem like he repeatedly read the entire file in memory, and then wrote it all again.

EDIT: As pointed out in the comments, there are ways you can play with file parameters to make them somewhat more efficient in terms of what the file system does with data. But basically, you will read the data and write it, with the buffer at a time, anyway.

+17
source share

Do this from the command line:

 copy 1.txt+2.txt+3.txt combined.txt 

or

 copy *.txt combined.txt 
+2
source share

You mean a merge that you want to solve using some kind of custom logic, which lines go there? Or do you mean that you basically want to merge files into one big one?

In the case of the latter, it is possible that you do not need to do this programmatically at all, just generate one batch file with this ( /b for binary code, delete if not necessary):

 copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv" 

Using C #, I would take the following approach. Write a simple function that copies two streams:

 void CopyStreamToStream(Stream dest, Stream src) { int bytesRead; // experiment with the best buffer size, often 65536 is very performant byte[] buffer = new byte[GOOD_BUFFER_SIZE]; // copy everything while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0) { dest.Write(buffer, 0, bytesRead); } } // then use as follows (do in a loop, don't forget to use using-blocks) CopStreamtoStream(yourOutputStream, yourInputStream); 
+2
source share

Why do you want to do this?

One way could be a low level fragmentation violin, it would be great if you worked on it.

Here is a wrapper for C #.

http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx

0
source share

All Articles