How problematic is it to read many small files from one directory?

I need to read many (up to 5 million) small (9 KB) files. At the moment, they are all in one directory. I'm afraid it will take square time or even n ^ 2 log n to search, right? Is this significant (will the search be longer than the actual reading)? Is there a difference in the asymptotic of the runtime when the file is cached by the OS?

I use C ++ streams to read files. Right now I'm using Windows 7 with NTFS, but later I ran the program on a Linux cluster (not sure which file system).

+5
source share
1 answer

Perhaps this is not so bad: if you list files and process each file name as you see it, your OS will most likely have a directory entry in the disk cache. And for practical purposes, the disk cache is O (1).

What will kill you is a mechanical hard drive. You will have 5 million disk accesses, each of which takes ~ 1/100 of a second. This is 50,000 seconds, more than half a day. This is the challenge that screams for SSD.

+4
source

All Articles