Clear Python Module Cache

When reading the documentation for the Python re module, I decided to look at the source code of re.py

When I opened it, I discovered the following:

 _cache = {} _MAXCACHE = 100 def _compile(*key): cachekey = (type(key[0]),) + key p = _cache.get(cachekey) if p is not None: return p #...Here I skip some part of irrelevant to the question code... if len(_cache) >= _MAXCACHE: _cache.clear() _cache[cachekey] = p return p 

Why is the cache cleared with _cache.clear() when it reaches _MAXCACHE entries?

Is a general approach to flushing the cache completely and starting from scratch?

Why didn't the longest time just be used when the lost value was deleted?

+8
python regex caching
source share
3 answers

If I were to guess, I would say that this was done in such a way as to avoid the need to keep track of when / how long the individual values ​​were stored in the cache, which would create both memory and processing costs. Since the cache object used is a dictionary that is inherently disordered, there is no good way to find out which order items were added to it without any other cache object. This could be solved using OrderedDict instead of the standard dictionary, assuming you are working with Python> = 2.7, but otherwise you will need to significantly change the caching method to eliminate the need for clear() .

+3
source share

Here is a quote from one of the developers of the new regex module planned for 3.3 regarding caching, this is part of the list of functions that separate the new module from the current re module.

7) Modify the recompiled cache expression to better handle the state of the splashes. Currently, when regular expressions are compiled, the result is cached so that if the same expression is compiled again, it is fetched from the cache and no additional work is required. This cache supports up to 100 entries. After reaching the 100th position, the cache is cleared and a new compiler should happen. The danger, all this is rare, is that you can compose the 100th expression only in order to find one that recompiles it and must repeat the same work when it was possible 3 expressions were made back. By changing this logic a little, you can set an arbitrary counter that gives a time stamp for each compiled record and, instead of clearing the entire cache when it reaches capacity, eliminates only the oldest half of the cache, saving half which is later. This should limit the ability to go through cases where a very large number of regular expressions are constantly recompiled. In addition to this, I have 256 entries, which means that the last 128 are saved.

http://bugs.python.org/issue2636

This seems to indicate that it is most likely the developer’s laziness or “emphasis on readability,” which explains the current caching behavior.

+5
source share

The caching point is to reduce the average call time of a function. The overhead of storing more information in _cache and trimming rather than clearing it would increase this average call time. The call to _cache.clear() will end quickly, and even if you lose the cache, it is preferable to maintain the state of the cache and have the overhead of removing individual items from the cache when the limit is reached.

When calculating cache efficiency, there are a few things to think about:

  • Average cache call time (very short)
  • Average call time when skipping cache (longer)
  • Cache hit rate (pretty rare)
  • Call time when the cache is cleared or trimmed (rather rarely)

The question is that increasing the number 3 makes sense if it means increasing # 2 and # 4. I assume it is not, or the difference is so small that it is preferable to save the code.

+1
source share

All Articles