The fastest way to save thousands of files in VB.NET?

I upload thousands of files every second. Each file is about 5 KB, and the total download speed is about 200 Mb / s. I need to save all these files.

The boot process is shared between thousands of different async tasks that are running. When they finish downloading the file and want to save it, they will add it to the file queue for saving.

Here is what the class looks like for this. I instantiate this class at the very beginning, and my tasks add files that need to be stored in the queue.

Public Class FileSaver Structure FileToSave Dim path As String Dim data() As Byte End Structure Private FileQueue As New Concurrent.BlockingCollection(Of FileToSave) Sub New() Task.Run( Async Function() While 1 Dim fl As FileToSave = FileQueue.Take() Using sourceStream As New FileStream(fl.path, FileMode.Append, FileAccess.Write, FileShare.None, bufferSize:=4096, useAsync:=True) Await sourceStream.WriteAsync(fl.data, 0, fl.data.Length) End Using End While End Function ) End Sub Public Sub Add(path As String, data() As Byte) Dim fl As FileToSave fl.path = path fl.data = data FileQueue.Add(fl) End Sub Public Function Count() Return FileQueue.Count End Function End Class 

There is only one instance of this class, there is only one queue. Each task does not create a separate queue. There is one global instance of this class with an internal queue, and all my tasks add files to this single queue.

Since then, I have replaced ConcurrentQueue with the default BlockingCollection , which should work just like ConcurrentQueue , but allows me to block Take() from the collection without having to constantly contact.

The hard drive that I use supports a maximum read / write speed of ~ 180 MB / s. I download only 200 Mbps, and it seems that I can’t save the data quickly, because the queue continues to grow. Something is wrong, and I cannot understand that.

Is this the best (fastest) way to do this? Can I create any improvements here?


EDIT: This question has been suspended and I cannot post my own answer with what I understood. I will post it here.

The problem is that when writing to a file is a relatively cheap process, opening a file for writing is not. Since I downloaded thousands of files, I saved them separately, which greatly degraded performance.

Instead, I used group files with several files (while they were still in RAM) together in one file (with delimiters) and wrote this file to disk. The files that I upload have some properties that allow them to be logically grouped in this way and are still used later. The ratio is about 100: 1.

Now I am no longer tied to the record, and now I save at ~ 40 MB / s, if I delete another premature limit, I will update this. Hope this helps someone.


EDIT2: More progress on my goal of speeding up I / O.

Since now I am merging several files into one, this means that I perform a total of 1 open (CreateFile) operation, and then several records to an open file. This is good, but still not optimal. It is better to make one record per 10 MB, rather than ten 1 MB. Repeated writing is slower and causes disk fragmentation, which then slows down the read. Not good.

Thus, the solution was to download all (or as many as you like) downloaded files into RAM, and then, as soon as I stumbled upon some point, write all of them to a single file using one write operation. I have ~ 50 GB of RAM, so this works fine for me.

However, now there is another problem. Since now I manually buffer my write data to make as few write operations as possible, the Windows cache becomes somewhat redundant and actually starts to slow down and burn RAM. Let's get rid of him.

The solution to this is to perform unbuffered (and asynchronous) I / O, which is supported by Windows' CreateFile (). But not easily supported in .NET. I had to use a library (the only one that seems to exist) to accomplish this task, which you can find here: http://programmingaddicted.blogspot.com/2011/05/unbuffered-overlapped-io-in-net.html

This allows the use of simple unbuffered asynchronous I / O from .NET. The only requirement is that now you need to manually align your byte () buffers one by one, otherwise WriteFile () will fail with an "Invalid parameter" error. In my case, it just required matching my buffers with a number equal to 512.

After all this, I was able to click on a write speed of ~ 110 MB / s. Much better than I expected.

+7
performance file file-io
source share
1 answer

I suggest you take a look at TPL DataFlow . It looks like you want to create a producer / consumer .

The beauty of using TPL DataFlow over your current implementation is that you can determine the degree of parallelism . This will allow you to play with numbers to best customize your solution to suit your needs.

As @Graffito mentions, if you use rotating plates, recording may be limited by the number of files that are being recorded at the same time, which makes this a trial version and a mistake for better performance tuning.

Of course, you can write your own mechanism to limit concurrency.

I hope this helps.

[Optional] I worked for a company that archived an email with similar requirements for writing to disk. This company had problems with io speeds when there were too many files in the directory. As a result, they decided to limit the files to 1000 files / folders in the directory. This decision was before my time, but may be relevant to your project.

+2
source share

All Articles