There was an error in your code originally.
if self.chunk_count % 50 == 0 self.raw_tweets.insert(self.tweet_list) self.chunk_count = 0
You reset chunk_count, but you do not reset tweet_list. So, the second time you are trying to insert 100 items (50 new and 50 that have already been sent to the database before). You fixed it, but you still see the difference in performance.
The whole thing of party size turns out to be a red herring. I tried using a large json file and uploading it via python and uploading it via mongoimport, and Python was always faster (even in safe mode - see below).
After carefully reviewing your code, I realized that the problem is that the streaming API is actually passing data to you in pieces. You just have to take these pieces and put them in the database (which mongoimport does). The extra work your python does is to split the stream, add it to the list, and then periodically send batches to Mongo, probably the difference between what I see and what you see.
Try this snippet for your handle_data ()
def handle_data(self, data): try: string_buffer = StringIO(data) tweets = json.load(string_buffer) except Exception as ex: print "Exception occurred: %s" % str(ex) try: self.raw_tweets.insert(tweets) except Exception as ex: print "Exception occurred: %s" % str(ex)
It should be noted that your python inserts do not work in "safe mode" - you must change this by adding the safe=True argument to your insert statement. After that, you will get an exception in any insert that fails, and your try / catch will print an error prone to the problem.
It's not that expensive in performance: I'm currently testing, and after about five minutes, the sizes of the two collections are 14120 14113.
Asya kamsky
source share