First, instead of using locks, use atomic decrements and compare-swap to manipulate your counter. The syntax for this depends on your compiler; in GCC you can do something like:
long remaining_cache_slots; void release() { __sync_add_and_fetch(&remaining_cache_slots, 1); }
This should help reduce the window for approval. However, you will still bounce this line of cache everywhere, which at a high request rate can seriously damage your performance.
If you agree to accept a certain amount of waste (i.e. allow the number of cached results (or rather, pending responses), slightly below the limit), you have other options. One of them is to make the cache stream local (if possible in your project). Another way is for each thread to reserve a pool of cache tokens.
What I mean by reserving a pool of cache tokens is that each thread can reserve in advance the right to insert N entries into the cache. When this thread deletes an entry from the cache, it adds it to its set of tokens; if it ends with tokens, it tries to get them from the global pool, and if there are too many, it returns some of them. The code might look something like this:
long global_cache_token_pool; __thread long thread_local_token_pool = 0; // Release 10 tokens to the global pool when we go over 20 // The maximum waste for this scheme is 20 * nthreads
Performing such requests in this way reduces the frequency with which threads access shared data, and therefore the number of conflicts and cache failures. However, as mentioned earlier, this makes your limit a little less accurate and therefore requires careful tuning to get the right balance.
source share