MongoDB - Tweet Consumption and Data Counting

I use the real-time streaming API on Twitter to maintain an active number of specific tracks. For example, I want to track the number of times "apple", "orange" and "pear" on Twitter. I use Mongo to store tweet data, but I have a question about how best to do the counting for each of the tracks I follow.

I will run this query once per second to get closer to the real-time score for each track, so I need to make sure that I am doing this correctly:

Option 1

Run a counter request for a specific track

db.tweets.count({track: 'apple'}) 

Given that a lot of data (potentially millions) will be stored in the tweet database, I wonder if this can be a bit slow?

Option 2

Create a second track_count collection and update the count attribute every time a new tweet appears:

 {track:'apple', count:0} {track:'orange', count:0} {track:'pear', count:0} 

Then, when a new tweet appears:

 db.track_count.update( { track:"apple" }, { $inc: { count : 1 } } ); 

Then I can update the counter for each track, but that means writing to the database twice, once for a tweet and again to increase the number of tracks. Bearing in mind, there may be a fair number (tens, possibly hundreds) of tweets arriving per second.

Does anyone have any suggestions on the best method for this?

+4
source share
2 answers

No doubt, use a separate track_count collection to maintain the total number of matches. Otherwise, you will re-request your entire collection of tweets every second, which will become very slow and expensive as the amount of data grows.

Do not worry about writing to the database twice, once to save a tweet, and then to increase the counter. Entries in MongoDB are extremely fast, and this solution will significantly exceed thousands of tweets per second even in one non-clustered instance of Mongo.

+3
source

Does anyone have any suggestions on the best method for this?

There is no "best" method. This is a classic compromise. You can do "counters", you can endure slow requests, you can run regular tasks with a decrease in the number of cards.

  • Two records => faster queries, more record activity
  • One record => slower queries, less record activity
  • Hourly M / R => slightly outdated data, slightly more entries

It is generally recommended to use counters. MongoDB is generally very good at handling large loads, especially this type of "gain" or load of counters.

You will not get more speed if you do not sacrifice something. Disk, RAM, CPU. Therefore, you will need to choose your compromise based on your needs.


Side Note: Is the track title unique?

You can try the following:

 {_id:'orange', count:0} {_id:'pear', count:0} 

Or for daily counting:

 {_id:'orange_20110528', count:0} {_id:'orange_20110529', count:0} {_id:'pear_20110529', count:0} 
+1
source

All Articles