Smart (?) Database Cache

I have seen several database caching mechanisms, all of which are pretty dumb (i.e. keep this query cached for X minutes ) and require that you manually delete the entire cache repository after the INSERT / UPDATE / DELETE request has been executed.

About 2 or 3 years ago, I developed an alternative database caching system for the project I was working on, the idea was mainly to use regular expressions to find tables participating in a particular SQL query:

 $query_patterns = array ( 'INSERT' => '/INTO\s+(\w+)\s+/i', 'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i', 'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i', 'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i', 'REPLACE' => '/INTO\s+(\w+)\s+/i', 'TRUNCATE' => '/TRUNCATE\s+(\w+)/i', 'LOAD' => '/INTO\s+TABLE\s+(\w+)/i', ); 

I know that these regular expressions probably have some flaws (my regular expression skills were pretty green) and obviously don't match the nested queries, but since I never use them, this is not a problem for me.

In any case, after searching for the involved tables, I sorted them alphabetically and created a new folder in the cache repository with the following naming convention:

 +table_a+table_b+table_c+table_...+ 

In the case of a SELECT query, I would extract the results from the database, serialize() them and save them in the corresponding cache folder, therefore, for example, the results of the following query:

 SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC; 

Will be saved to:

 /cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5 

MD5 is the request itself. After a subsequent SELECT query, the results will be trivial to retrieve.

In the case of any other type of write request ( INSERT , REPLACE , UPDATE , DELETE , etc.) I would glob() all folders with +matched_table(s)+ in their name everything would delete the entire contents of the file. Thus, there is no need to delete the entire cache, only the cache used by the corresponding and related tables.

The system worked very well, and the performance difference was visible - although the project had many read requests than write requests. Since then, I started using transactions, FK CASCADE UPDATES / DELETES and I never had time to improve the system to make it work with these functions.

I have used MySQL Query Cache in the past, but I have to say that performance is not even compared.

I wonder: am I the only one who sees beauty in this system? Are there any bottlenecks that I donโ€™t know about? Why do popular frameworks such as CodeIgniter and Kohana (I donโ€™t know Zend Framework ) have such rudimentary database caching systems?

More importantly, do you see this as a function worth pursuing? If so, is there anything I could do / use to do it even faster (my main problems are disk I / O and (de) serializing query results)?

I appreciate the whole entry, thanks.

+4
source share
6 answers

I see the beauty in this solution, however, I believe that it works only for a very specific set of applications. Scenarios in which this is not applicable include:

  • Databases that use cascading deletes / updates or any triggers. For example, your DELETE in table A might call DELETE from table B. The regular expression will never catch this.

  • Access to the database from points that do not pass through the cache invalidation scheme, for example. crontab scripts, etc. If you ever decide to implement replication on different machines (enter read-only slaves), it can also corrupt the cache (since it does not go through cache invalidation, etc.)

Even if these scenarios are unrealistic for your case, it still answers the question of why frameworks do not implement this kind of cache.

Regarding whether it is worth doing, it all depends on your application. Maybe you want to provide more information?

+2
source

The solution, as you described it, is at risk for concurrency problems. When you receive hundreds of queries per second, you should encounter a case where the UPDATE statement is executed, but before you can clear the cache, SELECT reads it and receives stale data. In addition, you may run into problems when multiple UPDATEs fall into the same rowset in a short period of time.

In a broader sense, the best caching practice is to cache the largest objects. For example, instead of having a bunch of "user-related" lines cached all over the place, it's better to just cache the "user" object itself.

Even better, if you can cache entire pages (for example, you display the same homepage for everyone, the profile page is similar to almost everyone, etc.), which is even better. One cache fetch for an entire, pre-rendered page far exceeds dozens of cache frames for request / level caches, followed by re-page layout.

In short: profile. If you take the time to perform some measurements, you will most likely find that caching large objects or even pages, rather than the small queries used to create them, is a huge performance gain.

+2
source

While I see the beauty in this - especially for environments where resources are limited and cannot be easily expanded, for example, on shared hosting - I personally will be afraid of complications in the future: what if someone is recently hired and not aware caching mechanism, starting to use nested queries? What should I do if any external service starts updating the table, but does not notice the cache?

For a specialized, specific project that urgently needs acceleration, which can not be helped by adding processor power or RAM, this looks like a great solution. As a common component, I find it too shaky and would be afraid of subtle problems in the long run that come from the fact that people forget that there is a cache that you need to know about.

+1
source

The improvement you are describing is to avoid invalidating caches, which are guaranteed not to be affected by the update, as they are retrieving data from another table.

This, of course, is good, but I'm not sure if it is thin enough to make a real difference. You will still carry a large number of caches that are not really needed (because the update was on the table, but on different lines).

In addition, even this โ€œsimpleโ€ schema is based on the ability to discover matching tables by looking at the SQL query string. This can be difficult to do in general due to views, table aliases, and multiple directories.

It is very difficult to automatically (and efficiently) determine if the cache should be invalid. Because of this, you can use a very simple scheme (for example, invalidity for each update or for a table, as on your system, which does not work too well when there are many updates), or a very manual cache for a specific application with deep intercepts in the logic requests (perhaps hard to write and hard to maintain), or accept that the cache may contain outdated data and periodically update them.

0
source

I suspect that regular expressions may not provide for every case - of course, they do not seem to be related to the scenario of mixing the names of the base tables and the tables themselves. for example, consider

update stats.measures set amount = 50 where id = 1;

and

Use statistics update measure value = 50, where id = 1;

Then PL / SQL.

Then there is the fact that it depends on each client choosing the advisory management mechanism, that is, it assumes that all access to the database is from machines that implement the cache management mechanism in a common file system.

(as a small point), it would be easier to simply check the modification time of the data files to determine if the cached version of the query remains in a certain set of tables, and then try to determine if the cache management mechanism detected an update - this would be much more reliable)

Stepping back a little, implementing this from scratch using a robust architecture would mean that all requests should be intercepted by the control mechanism. The control mechanism will probably need a more sophisticated query parser. Of course, for all instances of the control mechanism, a common watchdog substrate is required. This probably requires an understanding of the data dictionary - everything that has already been implemented by the database itself.

You state that "I have used MySQL Query Cache in the past, but I must say that performance is not even compared."

I find it rather strange. Of course, when dealing with large sets of results from queries, my experience is that loading data into a heap from a database is much faster than non-serialization of large arrays, although large result sets are quite untypical for web applications.

When I tried to speed up access to the database (after fixing everything else, of course), I took the path of replicating and splitting data into multiple DBMS instances.

FROM.

0
source

This is due to the problem of sharing sessions when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions is used to determine which tables (or even which rows) are read or written to. The system keeps track of which tables were recorded and when, and when it starts reading one of these tables, it is sent to the wizard. If the query reads from a table whose data does not have to be accurate, then it is sent to the follower. As a rule, the information should really be relevant when the user changes himself (i.e., edits the user profile).

They talk about it pretty well in the O'Reilly High Performance MySQL book. I used it quite a bit when I was developing a system for processing session sections on the same day.

0
source

All Articles