How do you scale the voting system on a high traffic website?

View

You get to the comment page. The comment voting system is highlighted with your votes.

vote up reddit

Database

To support this requirement, the database schema will look at a minimum:

Page

  • int pageId

Comments

  • int commentId
  • int pageId

Votes

  • int userId
  • int commentId
  • listing (up, down)

Controller

If the page ID was 123 and the user ID was 456, this would be a naive controller implementation:

1) Request all votes made by user 456 in the comments on page 123:

SELECT c.commentId, v.direction FROM comments AS c, votes AS v WHERE c.pageId = 123 AND c.commentId = v.commentId AND v.userId = 456 

2) Build a view with the results of this query.

Scalability issue

Querying a database to support this voting system is very expensive. The table of comments and votes will be huge. On a site with high traffic, thousands of users will execute this request every second to get a personalized view of the voting by comments. How do you scale this voting system so that the database is not overloaded with too many queries? Would you cache it in memory? Isn't it a betting practice to cache things common to a large audience? In this case, these queries apply to individual users. The memory will quickly be full on the website by millions of users. Cache skips will occur and the database will be crashed.

+4
source share
1 answer

I think Reddit will cache / store (for each comment) the list of users who voted for it (and one more for empty votes), and only update this cache every X seconds / minutes / hours depending on the activity. The list will be organized so that you can perform a binary search.

Then, when creating the page, the server only needs to say, checks if the current user ID is in the list of votes up / down for each comment. Reddit also limits the number of initially visible comments, which will reduce the number of tests required.

Reddit also does not update votes immediately (they add voting to the queue). They can link queue processing and voice caching.

I assume that reddit should also keep track of the latest votes for each user so that they can fill the gaps between the last cache update and now.

This may not be 100% accurate.
It is based on limited readings about Reddit architecture and what I will do.

+1
source

All Articles