Read from one file at a time

I have the following problematic situation. The data bouquet is split into 10k small files (approximately 8-16 kilobytes each). Depending on user input, I have to load them as quickly as possible and process them. More precisely, each data packet can be divided into 100-100 thousand files, and there are about 1 thousand data packets. However, most of them are smaller.

Now I use a thread pool and every access to the file, the next free stream opens the file, reads it and returns the data prepared for display. Since the number of files will grow in the future, I am not so pleased with this approach, especially if it most likely ends with something around 100 thousand or more files (deploying this, of course, will be fun;)).

So, the idea is to combine all these tiny files for one data packet into a large one and read from it. I can guarantee that it will be read-only, but I do not know the number of threads that will access the same file at the same time (I know the maximum number). This will give me about 1000 files of good size, and I can easily add new data packets.

The question is: how can I allow 1..N threads to be read efficiently from a single file in this scenario? I can use asynchronous I / O on Windows, but it should become synchronous for reading less than 64k. Memory matching the file is not an option, as the expected size is> 1.6 GB, and I still need to be able to run on x86 (if I cannot efficiently display a small part, read it, undo it again), my experience working with memory mapping was that it brought quite overhead compared to one reading).

I thought about opening each of the data packets N times and gave each thread a handle in cyclic mode, but the problem is that it can end (with the number of data files) x (maximum number of threads) open handles (it can easily become 8-16k), and I will need to synchronize every access to the data packet or use incorrect magic to get the next free file descriptor.

Since this does not seem to be the original problem (I think that any database engine has a similar structure, where you can have M tables (data packets) with N rows (files in my case), and you want to allow as many threads as possible for reading lines at the same time). So what is the recommended practice here? BTW, it should work on Windows and Linux, so portable approaches are welcome (or at least approaches that work on both platforms, even if they use different basic APIs - if they can be wrapped up, I'm happy).

[ EDIT ] It's not about speed, it's about hiding the delay. That is, I read as 100 of these tiny files per second, maybe, so I am at 1 mib / s at best. My main problem is search time (since my access pattern is not predictable), and I want to hide them by disabling reading, showing old data to the user. The question is how to allow multiple threads to issue IO requests across multiple files, possibly> 1 thread, referring to a single file.

It really is not a problem if one of the calls takes 70 ms or so to complete, but I cannot afford it if the lock is read.

+4
source share
5 answers

I don’t think multithreading will help you with disk reading. Assuming the file is on the same drive, you only have one set of read heads to access it, so you are serialized right there.

In this situation, I think that I will have one disk reading process that will sequentially read the file into buffers (this, we hope, maximizes read performance, since reading heads do not need to move the battle too much, assuming a rather unfragmented data file) and a series of processing threads that read buffers, marking them as free when they complete processing.

However, you decided to continue, can I suggest that you make sure that your code is structured in such a way that the number of different types of streams is easily configurable, ideally from the command line of executable files. In such situations, you will want to experiment with different thread configurations to find the best numbers for your specific situation.

+2
source

Linux doesn't have asynchronous I / O at all (yes, there is aio_ *, but it only works on O_DIRECT and has all kinds of weird restrictions), so if you want something portable, you just need to use regular read calls. mmap will work, but the cost of changing the display may be slightly higher if you read only a small amount each time.

Now I don’t know about Windows, but on Linux there is a pread () function that allows you to read from a file descriptor at a given offset without affecting the search pointer of the file descriptor. In this case, you can have any number of threads, reading from the same file without the need to lock the file descriptor or something stupid.

+1
source

The fastest way to present a large piece of data is to create a disk partition (both primary and logical, but not LVM), and sequentially read the partition device (for example, /dev/sda5 ) without a file system, using only one stream per disk. It is important to access a raw disk sequentially to avoid disk accesses that are much slower than sequential reads.

0
source

The problem that will hurt you is the head; no matter how many threads you use, the head can only be in one position at a time. Do you have the ability to distribute the file on multiple disks?

0
source

Mmap approach fits. You do not need to do a mmap / unmap loop for each read, but have a thread that processes all of these mappings and pointers to pass (actually offset and length). The real part of the reading will be scheduled by the OS when the thread accesses the virtual memory associated with the file.

Just keep in mind that too many threads will not improve read speed. Database engines typically have a fairly limited number of I / O threads that cater for all the I / O needs of applications.

0
source

All Articles