What obstacles should I be wary of when files with BIG memory are stored in memory?

I have a bunch of large files, each file can be more than 100 GB, the total amount of data can be 1 TB, and all of them are read-only (just random reads).

My program does small reads in these files on a computer with 8 GB of main memory.

To improve performance (without searching () and without copying the buffer), I thought about using memory mapping and basically a memory card with only 1 TB of data.

Although at first it seems crazy, since the main memory is <disk, with an understanding of how virtual memory works, you should see that there should be no problems on 64-bit machines.

All pages read from disk to respond to my read () will be considered β€œclean” from the OS, as these pages will never be overwritten. This means that all these pages can go directly to the list of pages that can be used by the operating system, without writing to the disk or replacing (erasing them). This means that the operating system can actually only store LRU pages in physical memory and will only work with reads () when the page is not in main memory.

This would mean no replacement and no increase in I / O due to the huge memory mapping.

This is a theory; what I'm looking for is any of you who each tried or used this approach for real production and can share their experience: are there any practical problems with this strategy?

+4
source share
1 answer

What you are describing is correct. With a 64-bit OS, you can map 1 TB of address space to a file and allow the OS to control reading and writing to the file.

You did not indicate which processor architecture you are using, but most of them (including amd64). The CPU maintains a bit in each page table entry as to whether data has been written to the page. The OS can indeed use this flag to not write pages that have not been changed to disk.

There would be no increase in IO just because the mapping is large. The amount of data that you really get will determine this. Most operating systems, including Linux and Windows, have a single page cache model in which cached blocks use the same physical memory pages as memory mapped pages. I would not expect the OS to use more memory with memory mapping than with cached IO. You just get direct access to cached pages.

One of the problems that may arise is to flush modified data to disk. I'm not sure what the policy is in your OS, but the time between page changes and when the OS will actually write data to disk can be much longer than you expect. Use the flush API to force data to be written to disk if it is important that they are written at a specific time.

I have not used file mappings that were so large in the past, but I would expect it to work well and at least be worth a try.

+3
source

All Articles