Reading a large number of files quickly

I have a large number (> 100k) of relatively small files (1kb - 300kb) that I need to read and process. I am currently browsing all the files and using File.ReadAllText I File.ReadAllText reading the contents, processing it and reading the next file. This is pretty slow, and I was wondering if there is a good way to optimize it.

I have already tried using multiple threads, but since this seems to be related to IO, I have not seen any improvements.

+7
source share
5 answers

Most likely you are right. Reading that many files are likely to limit your potential acceleration, as disk I / O will be the limiting factor.

At the same time, you most likely can make a small improvement by transferring the processing of data to a separate stream.

I would recommend trying one "producer" stream that reads your files. This flow will be limited by IO. When he reads a file, he can translate the “processing” into the ThreadPool stream (.NET 4 tasks are also important for this) to execute the processing, which would allow him to read the next file immediately.

This will at least lead to “processing time” out of the total execution time, which will make the total time for your work almost as fast as Disk IO, if you have an extra kernel or two to work with ...

+7
source share

What I would do is to do the processing in a separate thread. I would read in a file and save the data in a queue, then read in the next file, etc.

In your second thread, ask the thread to read data from this queue and process it. See if it helps!

+2
source share

This is probably the disk search time, which is the limiting factor (this is one of the most common bottlenecks when executing Make, which usually includes many small files). The silent file system constructions have an entry in the directory and insist on a pointer to disk blocks for the file, and this guarantees a minimum of 1 file search.

If you use Windows, I would switch to using NTFS (which stores small files in a directory entry (-> save one disk for each file). We also use disk compression (more computing, but processors are cheap and fast, but less space on disk → less read time), it may not be practical if your files are all small. Perhaps the equivalent of the Linux file system if you are there.

Yes, you have to run a bunch of threads to read files:

  forall filename in list: fork( open filename, process file, close filename) 

You may need to disable this feature to avoid streaming, but I shoot hundreds not 2 or 3. If you do, you tell the OS that it can read a lot of disk space, and it can order multiple requests by posting to drive ( elevator algorithm ), and this will also help minimize head movement.

0
source share

I would recommend "MultiThreading" to solve this problem. When I read your posts, I unexpectedly discovered that Reed Copsi's answer would be so productive. You can find a sample for this solution, which was prepared by Elmue on this link . Hope this can be helpful thanks to Reed Copsi . Relationship

0
source share

I agree with the comments of Reed and Eismanain. Also, consider how to increase your IO disk. For example, distribute files across multiple disks so that they can be read in parallel and use faster disks, such as an SSD or, possibly, a RAM disk.

0
source share

All Articles