Proper data structure to use for (this particular) expiring cache?

I need to read from a dataset that is very large, highly interconnected, the data is quite localized, and reading is quite expensive. In particular:

  • Datasets have a size of 2gigs - 30gigs, so I need to display the file sections in memory for reading. This is very expensive compared to the rest of the work that I do in the algorithm. From profiling, I found that approximately 60% of the time is spent reading memory, so this is a good place to start optimizing.
  • When you are working on part of this dataset, I have to keep track of its links (suppose it looks like a linked list), and although these readings are not guaranteed anywhere next to sequential ones, they are quite localized. It means:
  • Say, for example, we are working on 2 megabytes of memory at a time. If you read 2 megabytes of data in memory, approximately 40% of what I read, which I will have to do, will be in the same 2 megabytes of memory. About 20% of the reads will be purely random access to the rest of the data, and the remaining 40% will most likely contact the 2meg segment that indicated this.

From knowledge of the problem and profiling, I believe that introducing the cache into the program will help a lot. I want to create a cache that contains N pieces of X megabytes of memory (N and X are configurable, so I can configure it), which I can check first, before I have to map another section of memory. In addition, the longer something has been in the cache, the less likely we are to request this memory in the short term, and therefore the oldest data must have expired.

In the end, my question is very simple: What data structure is best to implement such a cache?

I need to have a very quick search to find out if this address is in the cache. With every cache miss, I want to expire its oldest member and add a new member. However, I plan to try to tweak it (by changing the amount that is cached) so that 70% or more of the reads are read.

My real thinking is to use either the AVL tree (LOG2 n for search / insert / delete) would be the safest (without degenerate cases). My other option is a rare hash table, so searches will be in O (1) at best. Theoretically, this could degenerate into O (n), but in practice I could keep the collisions low. The concern here will be how long it will take to find and delete the oldest entry in the hash table.

Does anyone have any thoughts or suggestions about which data structure would be best here and why?

+4
source share
3 answers

Looks like you're looking for LRU (most recently used) cache: LRU design cache

+3
source

If 60% of your algorithm is I / O, I believe that the actual design of the cache doesn't really matter - any kind of cache can be a significant acceleration.

However, the design depends on what data you use to access your pieces. String, int, etc. If you have an int, you can make a hash map in a linked list, erase back to cache miss, erase, and then click on top if the cache hits.

hashmaps are provided under different names (most often, an unordered map) in many implementations. Boost has one, there is one in TR1, etc. The big advantage of hash_map is less performance loss with increasing numbers and more flexibility with respect to key values.

+2
source

Put the cache in two sorted trees (AVL or any other reasonably balanced tree implementation is in order - you'd better use one of the libraries than create your own).

One tree should sort by position in the file. This allows you to search the logs (n) to see if your cache exists.

Another tree should be sorted by time (which can be represented by a number that increases by one with each use). When you use a cached block, you delete it, update the time, and insert it again. It will also take log (n). When you skip, remove the smallest tree element and add the new block as the largest. (Remember to also remove / add this block in the by-position-in-file tree.)

If there are not many elements in your cache, it will be even better for you to save everything in a sorted array (using insertion sorting to add new elements). Moving 16 elements to one place in the array is incredibly fast.

+2
source

Source: https://habr.com/ru/post/1313331/


All Articles