Large MySQL tables

I am working on a problem that requires caching paginated search results: Markup of very large datasets

The search works as follows: taking into account item_id, I find the corresponding item_ids and their rank.

I am ready to give in without showing any results to my users, for example, 500. After 500, I am going to assume that they are not going to find what they are looking for ... the results are sorted in duel anyway. Therefore, I want to cache these 500 results, so I only need to perform a heavy lifting of the request once, and users can still print the results (up to 500).

Now suppose I use the MySQL staging table as a cache ... that is, I save the 500 best results for each element of the matches table, for example: "item_id (INTEGER), matched_item_id (INTEGER), match_rank (REAL)". Search now becomes extremely fast:

SELECT item.* FROM item, matches WHERE matches.item_id=<item in question> AND item.id=matches.matched_item_id ORDER BY match_rank DESC LIMIT x,y 

I would not have problems re-indexing the elements and their matches in this table, as they are requested by clients if the results are older than, say, 24 hours. The problem is saving 500 results for N elements (where N is from 100,000 to 1,000,000) this table becomes quite large ... 50,000,000 - 500,000,000 rows.

Can MySQL handle this? What should I pay attention to?

+3
source share
4 answers

MySQL can handle this many lines, and there are several scaling methods when you start hitting the wall. Partioning and replication are the main solutions for this scenario.

You can also check additional scaling methods for MySQL in the question I asked earlier here in stackoverflow.

+4
source

Agreed above. Be very careful to avoid premature optimization by denormalizing here.

Do not use "SELECT *". Additional fields mean more reading on disk.

Make sure you use coverage indexes, i.e. You can get all the requested field values ​​from the index without going to the data table. Double check that you are not reading the record data.

Test, test test.

If possible, use a table for writing only (i.e. no updates or deletes), so mysql does not reuse deleted spaces and populates indexes.

Make sure the indexed fields are as short as possible (but not shorter.)

EDIT: something else came to mind ...

The standard (and fastest) MyISAM table types do not have the ability to maintain records in any sequence other than the insertion order (changed by filling in deleted rows), that is, without clustered indexes. But you can fake it if you periodically copy / rearrange the table based on an index useful for grouping related records on one page. Of course, the new records will not match, but 98% of the table efficiency is better than the default value.

Carefully read the configuration settings, especially the cache size. In fact, to make this easy, don’t worry about any settings other than cache sizes (and understand what they are).

Read the statistics log carefully, as it relates to the effectiveness of the configuration cache settings.

Run the "slow query log" all the time. This is low overhead, and this is the first stop in all the fixes.

This goes without saying, but do not run anything other than a database on the same server. One of the main reasons is to optimize resources only for the database.

DO NOT denormalize until something falls apart.


Non-negotiation issues.

Anything above this line is dubious advice. Never take any advice without understanding this and testing it. There are two sides to each design decision; and MySQL online consultation is worse than average when generalizing without qualifications and without scaling benefits and fines. Answer everything that I noticed here. Understand what you do, why you do it, and what benefits you expect to receive. Measure changes to see what happened.

Never, never "try to see something that is happening." This is similar to tuning a car with several carburetors, except that it is worse. If what you expected did not happen, discard the changes and either find out or work on something else that you understand. Sleep is your friend; Most of this will come to you the night after heavy testing sessions.

You will never understand all this; you always need to learn more than you know. Always ask "Why" and "What is your proof." (Often this is something read that does not apply to your situation.)

+1
source

MySQL can handle this. The real question is: can he handle this in a reasonable amount of time? It depends on your request. As Eran Halperin said in his answer, markup and replication are for optimization.

0
source

As others have said, MySQL can easily scale to accommodate very large data sets, and often it will process large sets (several million rows) without much intervention from the / dba developer for a little reasonable indexing and query optimization. @doofledorer is correct, avoiding premature optimization. As the guys from 37 Signals say, if the application is success on the runway, and you encounter problems with the database - well, that's a great place for you.

I would, however, refute this question by one of my own - do you really need to use MySQL as your caching system? There are many places to store a list of 500 integers, and my first choice will be the server side in the session. Even if session data is written to disk, loading this array of 500 ints will not be so slow - and there are many strategies for using in-memory caches (like MemCache) to speed it up.

Quoting through your saved session array and executing 10, 20 (or any other page) of individual queries along the lines of “select item. *, Where id = X” may sound scary - of course, this will increase the physical number of queries, but it will be fast highlighted, especially when clicking on caching MySQL queries.

Edit: Sam Comments emphasized what I forgot: If you use a session-based approach, you unwittingly benefit from the state-based session. You do not need to worry about clearing expired data - when the session ends, poof, it is gone. And if you stick to disk-based disks (I'm working on assuming PHP as a server language here), then remember that disk space is incredibly cheap.

In the end, it becomes a compromise between ease of use (in terms of development / support), scalability and performance. I would simply say that you do not understand that just because you are dealing with the results of a query to the database does not mean that the database is the best method of storing these results in all cases - keep an open mind!

0
source

All Articles