Optimization of requests for popularity of content using hits

Question

Optimization of requests for popularity of content using hits

I was looking for something, but did not come up with anything, maybe someone can point me in the right direction.
I have a website with a lot of content in a MySQL database and a PHP script that downloads the most popular hit content. He does this by registering every content that falls into the table along with access time. A selection query is then run to find the most popular content in the last 24 hours, 7 days, or a maximum of 30 days. Cronjob deletes anything older than 30 days in the log table.

The problem that I am facing now is that the log table with 1m + hit entries is growing on the website, and this really slows down my select request (10-20s). At first I, although the problem was the connection that I had in the request to get the content header, url, etc. But now I'm not sure that when testing, removing the connection does not speed up the request as much as I would.

So my question is what is the best thing to do when maintaining / choosing such popularity? Are they good open source scripts? Or what would you suggest?

Table

hit table table hit
nid | Paste | Tid
nid: Node Content ID
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term / category identifier
"node" content table
nid | title | status | (there are more, but they are important)
nid: Node ID
name: content name
status: content is published (0 = false, 1 = true)

SQL

SELECT node.nid, node.title, COUNT(popularity.nid) AS count FROM `node` INNER JOIN `popularity` USING (nid) WHERE node.status = 1 AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY) GROUP BY popularity.nid ORDER BY count DESC LIMIT 10;

+4

performance sql php mysql database-design

Owen Jun 2 '11 at 8:32

source share

5 answers

You actually have two problems that need to be addressed further down the road.

One of them, which you have yet to launch, but which you can earlier than you want, will embed the bandwidth in your statistics table.

The other that you indicated in your question actually uses statistics.

Start with login bandwidth.

First, if you do this, do not track statistics on pages that may use caching. Use a php script that advertises itself as empty javascript or as a single-pixel image and includes the latter in the pages you are tracking. This makes it easy to cache the remaining content of your site.

In the telecommunications business, instead of the actual inserts related to billing by phone calls, things are stored in memory and periodically synchronized with the disk. This allows you to manage gigantic bandwidth while maintaining hard drives.

To continue similarly on your part, you will need an atomic operation and some storage in memory. Here are a few memcache-based pseudo-codes to execute the first part ...

For each page, you need a Memcache variable. In Memcache, increment () is atomic, but add (), set (), etc. No. Therefore, you need to be careful not to miss counts when simultaneous processes simultaneously add the same page:

 $ns = $memcache->get('stats-namespace'); while (!$memcache->increment("stats-$ns-$page_id")) { $memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes $db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory }

Say every 5 minutes periodically (adjust the timeout accordingly), you will want to synchronize all this with the database without any possibility of simultaneous processes affecting each other or present. To do this, you increase the namespace before doing anything (this gives you a lock of existing data for all purposes and tasks) and is a little sleepy to handle existing processes that refer to the previous namespace if necessary:

 $ns = $memcache->get('stats-namespace'); $memcache->increment('stats-namespace'); sleep(60); // allow concurrent page loads to finish

Once this is done, you can safely scroll through the page identifiers, update statistics accordingly and clear the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a little more than just selecting, inserting, updating, and deleting statements that run from your scripts, however, continuing ...

As another respondent said, it’s quite appropriate to maintain intermediate statistics for your purpose: store hits, not individual hits. In the best case scenario, I assume that you want hourly statistics or quarterly statistics, so it’s normal to deal with subtotals that load every 15 minutes.

Even more important for you, since you order messages using these totals, you want to keep aggregated totals and have an index on the latter. (We will get to where further.)

One way to maintain totals is to add a trigger that, when inserted or updated into the statistics table, will adjust the overall statistics if necessary.

Be especially careful about dead locks. While no two runs of $ns will mix their respective statistics, there is still a (subtle) chance that two or more processes will start the increment $ ns step described above and subsequently issue statements that seek update is counted simultaneously. Obtaining advisory locks is the easiest, safest, and fastest way to avoid problems associated with this.

Assuming you are using a control lock, it’s quite normal to use: total = total + subtotal in updating the statement.

While on the topic of locks, please note that updating totals will require an exclusive lock for each affected row. Since you order them, you do not want all of them to be processed at a time, because this may mean maintaining an exclusive lock for an extended duration. The simplest thing here is to process the inserts in the statistics in smaller batches (say 1000), followed by a commit.

For intermediate statistics (monthly, weekly) add a few boolean fields (bit or tinyint in MySQL) to the statistics table. Ask each of them to save whether they will be taken into account with monthly, weekly, daily statistics, etc. Place a trigger on them so that they increase or decrease the applicable totals in the stat_totals table.

As a final note, give some thoughts on where you want to keep the actual score. This should be an indexed field, and the latter will be greatly updated. Typically, you want it to be stored in its own table, not in the page table, to avoid cluttering your page table with (much larger) dead lines.

Assuming you have done all of the above, your final request would look like this:

 select p.* from pages p join stat_totals s using (page_id) order by s.weekly_total desc limit 10

It should be fast enough with the index weekly.

Finally, do not forget that the most obvious is: if you run these general / monthly / weekly / etc requests over and over again, their result should also be placed in memcache.

+1

Denis de bernardy Jun 2 '11 at 11:41

source share

you can add indexes and try tuning your SQL, but the real solution here is to cache the results.

you really only need to encapsulate the last 7/30 days of traffic once a day

and could you do 24 hours a day?

even if you did it every 5 minutes, which still represents a huge savings due to the (expensive) request for each hit of each user.

0

David chan Jun 2 '11 at 8:39

source share

RRDtool

Many tools / systems do not create their own logging and aggregation of logs, but use RRDtool (tool round-robin database) to effectively manage time series data. RRDtools is also equipped with a powerful graphics subsystem and (according to Wikipedia ) there are bindings for PHP and other languages.

From your questions, I assume that you do not need any special and bizarre analysis, and RRDtool will effectively do what you need without having to implement and configure your own system.

0

alienhard Jun 2 '11 at 8:56

source share

You can do some “aggregation” in the background, for example, using the con job. Some suggestions (in a specific order) that may help:

1. Create a table with hourly results. This means that you can still create the desired statistics, but you will reduce the amount of data to (24 * 7 * 4 = about 672 records per page per month).

Your table might be somewhere close to this:

  hourly_results (
 nid integer,
 start_time datetime,
 amount integer
 )

after you analyze them in your general table, you can more or less delete them.

2. Use results caching (memcache, apc) You can easily save results (which should not change every minute, but rather every hour?), Or in the memcache database (which you can update again from cronjob), use file caching by serializing objects / results if you have little memory.

3. Database optimization 10 seconds is a long time. Try to figure out what is happening with your database. Are you running out of memory? Need more indexes?

0

Arend Jun 2 '11 at 9:47

source share

cusimar9 · Accepted Answer · 2011-06-02T09:14:57+0000

We faced a similar situation, and that is how we circumvented it. We decided that we do not care what exact time happened, only on the day when it happened. Then we did the following:

Each record has a “total hits” record, which is incremented every time something happens.
The log table records these "total hits" per day (in the cron job).
Having chosen the difference between two given dates in this log table, we can very quickly display “hits” between two dates.

The advantage of this is that the size of your log table does not exceed NumRecords * NumDays, which in our case is very small. Also, any queries in this log table are very fast.

The disadvantage is that you lose the ability to display hits by time of day, but if you do not need it, then this may be worth considering.

Optimization of requests for popularity of content using hits

More articles: