Kassandra low performance?

I need to select Cassandra or MongoDB (or another nosql database, I accept offers) for a project with a lot of inserts (1M / day). Therefore, I am creating a small test to measure recording performance. Here is the code to insert in Cassandra:

import time import os import random import string import pycassa def get_random_string(string_length): return ''.join(random.choice(string.letters) for i in xrange(string_length)) def connect(): """Connect to a test database""" connection = pycassa.connect('test_keyspace', ['localhost:9160']) db = pycassa.ColumnFamily(connection,'foo') return db def random_insert(db): """Insert a record into the database. The record has the following format ID timestamp 4 random strings 3 random integers""" record = {} record['id'] = str(time.time()) record['str1'] = get_random_string(64) record['str2'] = get_random_string(64) record['str3'] = get_random_string(64) record['str4'] = get_random_string(64) record['num1'] = str(random.randint(0, 100)) record['num2'] = str(random.randint(0, 1000)) record['num3'] = str(random.randint(0, 10000)) db.insert(str(time.time()), record) if __name__ == "__main__": db = connect() start_time = time.time() for i in range(1000000): random_insert(db) end_time = time.time() print "Insert time: %lf " %(end_time - start_time) 

And the code to be inserted into Mongo changes the connection function:

 def connect(): """Connect to a test database""" connection = pymongo.Connection('localhost', 27017) db = connection.test_insert return db.foo2 

Results: ~ 1046 seconds to insert in Kassandra and ~ 437 to complete in Mongo. He suggested that Kassandra is much faster than Mongo by inserting data. So what am I doing wrong?

+8
python mongodb cassandra nosql
source share
5 answers

In Kassandra, there is no equivalent to the unsafe Mongo regime. (We used to have one, but we pulled it out because it's just a bad idea.)

Another major problem is that you are doing single threaded inserts. Cassandra is designed for high concurrency; you need to use multithreaded test. See the chart below http://spyced.blogspot.com/2010/01/cassandra-05.html (the actual numbers are out of date for over a year, but the principle is still true).

Source Distribution Cassandra has such a test included in contrib / stress.

+12
source share

If I'm not mistaken, Cassandra allows you to specify whether you are using the "safe mode" insert equivalent to MongoDB. (I do not remember the name of this function in Kassandra)

In other words, Cassandra can be configured to write to disk and then return, unlike the default MongoDB configuration, which returns immediately after performing an insertion without knowing if the insert was successful or not. It just means that your application never waits for passage \ from the server.

You can change this behavior using safe mode in MongoDB, but this is known to have a big impact on performance. Turn on safe mode and you can see different results.

+4
source share

You will use the true power of Cassandra if you have multiple nodes running. Any node will be able to fulfill a write request. Client multithreading only floods more requests for the same instance, which will not help after the point.

  • Check the cassandra log for events that occur during your tests. Cassandra will start recording the disc after the Memtable is full (you can configure this, make it large enough, and you will deal with RAM + write to write commit logs). If you record a Memtable disc during a test, it will slow down. I do not know when MongoDB writes to disk.
+1
source share

May I suggest a look at the Membrane? It is used in the same way as memcached and is fully distributed, so you can continuously scale the write input speed by simply adding more servers and / or more RAM.

In this case, you will definitely want to go with the Moxi client to give you better performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need further instructions ... I am glad to let you through it, and I am sure that Membase can easily handle this load.

+1
source share

Create a batch mutator to perform multiple insert, update, and delete using as many as possible.

http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch

The batch mutator helped me reduce insertion time by at least half

+1
source share

All Articles