Why are I / O files in large pieces of SLOWER than in small pieces?

If you call ReadFile once with a size as 32 MB as the size, it takes significantly longer than if you read an equivalent number of bytes with a smaller block size, for example 32 KB.

Why?

(No, my disk is not busy.)


Change 1:

Forgot to mention - I am doing this with FILE_FLAG_NO_BUFFERING !


Edit 2:

Weird ...

I no longer have access to my old machine (PATA), but when I tested it there, it took about 2 times, and sometimes more. On my new machine (SATA) I get only a 25% difference.

Here is a snippet of code to verify:

 #include <memory.h> #include <windows.h> #include <tchar.h> #include <stdio.h> int main() { HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL); __try { const size_t chunkSize = 64 * 1024; const size_t bufferSize = 32 * 1024 * 1024; void *pBuffer = malloc(bufferSize); DWORD start = GetTickCount(); ULONGLONG totalRead = 0; OVERLAPPED overlapped = { 0 }; DWORD nr = 0; ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped); totalRead += nr; _tprintf(_T("Large read: %d for %d bytes\n"), GetTickCount() - start, totalRead); totalRead = 0; start = GetTickCount(); overlapped.Offset = 0; for (size_t j = 0; j < bufferSize / chunkSize; j++) { DWORD nr = 0; ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped); totalRead += nr; overlapped.Offset += chunkSize; } _tprintf(_T("Small reads: %d for %d bytes\n"), GetTickCount() - start, totalRead); fflush(stdout); } __finally { CloseHandle(hFile); } return 0; } 

Result:

Big read: 1076 for 67108864 bytes
Little reads: 842 for 67108864 bytes.

Any ideas?

+8
windows file-io winapi readfile
source share
6 answers

Your test includes the time it takes to read the file metadata, in particular, the mapping of data files to disk. If you close the file descriptor and open it again, you will get the same timings for each. I checked it locally to make sure.

The effect is probably more severe with heavy fragmentation, since you need to read more files on drive mappings.

EDIT: To be clear, I made this change locally and saw almost the same times with large and small readings. Repeating the same file descriptor, I saw similar timings from the original question.

+1
source share

This does not apply to windows. Several times I conducted several tests with the C ++ iostream library and found that there was an optimal buffer size for reading, above which the performance worsened. Unfortunately, I no longer have tests, and I can’t remember what size it was :-). As for why there are a lot of problems, such as a large buffer causing paging in other applications running at the same time (since the buffer cannot be unloaded).

+1
source share

When you read with a resolution of 1024 * 32 Kbytes, do you read the same memory block over and over again, or do you allocate a total of 32 MB for glad and fill all 32 MB?

If you are reading smaller reads in the same 32K memory block, the time difference is probably just that Windows does not need to recycle additional memory.


Update based on FILE_FLAG_NO_BUFFERING add- FILE_FLAG_NO_BUFFERING to the question:

I am not 100% sure, but I believe that when FILE_FLAG_NO_BUFFERING used, Windows blocks the buffer in physical memory, so it can allow the device driver to process physical addresses (for example, DMA directly to the buffer). He could (I suppose) do this by breaking a large request into smaller requests, but I suspect Microsoft might have a philosophy that "if you ask for FILE_FLAG_NO_BUFFERING , then suppose you know what you are doing and we are not going to interfere you".

Of course, locking 32 MB at the same time, rather than 32 KB at a time, will require more resources. Thus, this would be similar to my initial assumption, but at the level of physical memory, and not at the level of virtual memory.

However, since I do not work for MS and do not have access to the Windows source, I am going to have a vague memory from the time when I worked closer to the Windows kernel model and device model (so this is more or less speculation).

+1
source share

when you did FILE_FLAG_NO_BUFFERING , this means that the operating system will not buffer I / O. Therefore, each time you call the read function, it issues a system call that will retrieve data from the disk each time. Then, to read a single file with a fixed size, if you use a smaller buffer size, more system calls are required, so more user space is required for kernel space and for each disk I / O run. Instead, if you use a larger block size, in order for the same file size to be read, there would be less system calls required for user kernel space switches to be smaller and the number of disk I / O initiated less. This is why, as a rule, a larger block takes less time to read.

Try reading only 1 byte file at a time without buffering, and then try with 4096 bytes block and see the difference.

0
source share

A possible explanation, in my opinion, would be a command queue with FILE_FLAG_NO_BUFFERING , since this is a direct reading of DMA at a low level.

One large request, of course, will still be necessarily divided into sub-requests, but they will most likely be sent more or less one after another (since the driver needs to block pages and, in all likelihood, will not want to block several megabytes so that it does not hit the quota).

On the other hand, if you press a dozen or two dozen requests for a driver, it simply sends them to disk and disk and uses NCQ.

Well, this is what I think may be the reason anyway (this does not explain why the same thing happens with buffered reads, like in Q, with which I am linked above).

0
source share

What you are probably observing is that when using smaller blocks, the second block of data can be read during processing of the first, then the third is read while the second is being processed, etc., so that the speed limit is slower than the physical read time or processing time. If it takes the same amount of time to process one block in order to read the next, the speed can be twice as fast as it would if processing and reading were separate. When using large blocks, the amount of data that is read during processing of the first block will be limited by the amount less than the block size. When the code is ready for the next data block, part of it will be read, but some of them will not; thus, it is necessary for the code to wait while the rest of the data is retrieved.

0
source share

All Articles