Category with a lot of pages (huge offsets) (how does stackoverflow work?)

I think my question can be resolved simply by knowing how stackoverflow, for example, works.

For example, this page loads in a few ms (<300 ms): https://stackoverflow.com/questions?page=61440&sort=newest

The only question I can think of for this page is something like SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {pageNumber}*{stuffPerPage}+{stuffPerPage}

Such a request may take several seconds, but the page loads almost instantly. This cannot be a cached request, as this question is sent over time and rebuilds the cache each time the question is issued - this is just crazy.

So how does this work in your opinion?

(to ease the matter, forget about ORDER BY) Example (the table is fully cached in ram and stored in the ssd disk)

 mysql> select * from thread limit 1000000, 1; 1 row in set (1.61 sec) mysql> select * from thread limit 10000000, 1; 1 row in set (16.75 sec) mysql> describe select * from thread limit 1000000, 1; +----+-------------+--------+------+---------------+------+---------+------+----------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+---------------+------+---------+------+----------+-------+ | 1 | SIMPLE | thread | ALL | NULL | NULL | NULL | NULL | 64801163 | | +----+-------------+--------+------+---------------+------+---------+------+----------+-------+ mysql> select * from thread ORDER BY thread_date DESC limit 1000000, 1; 1 row in set (1 min 37.56 sec) mysql> SHOW INDEXES FROM thread; +--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | thread | 0 | PRIMARY | 1 | newsgroup_id | A | 102924 | NULL | NULL | | BTREE | | | | thread | 0 | PRIMARY | 2 | thread_id | A | 47036298 | NULL | NULL | | BTREE | | | | thread | 0 | PRIMARY | 3 | postcount | A | 47036298 | NULL | NULL | | BTREE | | | | thread | 0 | PRIMARY | 4 | thread_date | A | 47036298 | NULL | NULL | | BTREE | | | | thread | 1 | date | 1 | thread_date | A | 47036298 | NULL | NULL | | BTREE | | | +--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 5 rows in set (0.00 sec) 
+7
source share
2 answers

Create a BTREE index in the date column and the query will be launched in the wind .

 CREATE INDEX date ON stuff(date) USING BTREE 

UPDATE: here is the test I just did:

 CREATE TABLE test( d DATE, i INT, INDEX(d) ); 

Filling a table with two thousand rows with different unique i and d s

 mysql> SELECT * FROM test LIMIT 1000000, 1; +------------+---------+ | d | i | +------------+---------+ | 1897-07-22 | 1000000 | +------------+---------+ 1 row in set (0.66 sec) mysql> SELECT * FROM test ORDER BY d LIMIT 1000000, 1; +------------+--------+ | d | i | +------------+--------+ | 1897-07-22 | 999980 | +------------+--------+ 1 row in set (1.68 sec) 

And here is an interesting observation:

 mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 1000, 1; +----+-------------+-------+-------+---------------+------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------+---------+------+------+-------+ | 1 | SIMPLE | test | index | NULL | d | 4 | NULL | 1001 | | +----+-------------+-------+-------+---------------+------+---------+------+------+-------+ mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 10000, 1; +----+-------------+-------+------+---------------+------+---------+------+---------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+---------+----------------+ | 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 2000343 | Using filesort | +----+-------------+-------+------+---------------+------+---------+------+---------+----------------+ 

MySql uses the index for OFFSET 1000, but not for 10000.

Even more interesting if I make a FORCE INDEX request takes longer:

 mysql> SELECT * FROM test FORCE INDEX(d) ORDER BY d LIMIT 1000000, 1; +------------+--------+ | d | i | +------------+--------+ | 1897-07-22 | 999980 | +------------+--------+ 1 row in set (2.21 sec) 
+2
source

I think StackOverflow should not reach strings with an offset of 10,000,000. The query below should be fast enough if you have an index on date and the numbers in LIMIT are from real examples, not millions :)

 SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {stuffPerPage} 

UPDATE:

If records in a table are relatively rarely deleted (for example, in StackOverflow), you can use the following solution:

 SELECT * FROM stuff WHERE id between {stuffCount}-{pageNumber}*{stuffPerPage}+1 AND {stuffCount}-{pageNumber-1}*{stuffPerPage} ORDER BY id DESC 

Where {stuffCount} :

 SELECT MAX(id) FROM stuff 

If you have some deleted records in the database, then some pages will have fewer {stuffPerPage} entries, but that should not be a problem. StackOverflow also uses some inaccurate algorithm. For example, try going to the first page and the last page, and you will see that both pages return 30 records per page. But mathematically, this is nonsense.

Solutions designed to work with large databases often use some hacks that are usually invisible to ordinary users.


Paging with millions of records is currently not volatile because it is impractical. It is currently popular to use endless scrolling (automatic or manual with the touch of a button). It makes more sense, and pages load faster because they do not need to be reloaded. If you think that old posts might be useful for your users, then it's a good idea to create a page with random posts (with endless scrolling). That was my opinion :)

0
source

All Articles