I think it deserves an update since Cassandra 1.2 was released recently.
I have been using Cassandra in production for the past 18 months for social games.
My thing is that you should use Cassandra for your strengths. Therefore, a good understanding of what and how it does, you need to see which data model to use or even determine whether another database solution is more useful to you.
OrderedPartitioner is only useful if your application relies on key range requests, but you refuse one of Cassandra’s most powerful features for this: automatic scaling and load balancing. Instead of querying for rows of key rows, try to implement the same functions that you need using the ranges of column names on the same row. TL; DR read / write will NOT be balanced between nodes using this.
RandomPartioner (md5 hashing) and MurmurPartitioner (murmur hashing is better and faster) is how YOU SHOULD go if you want to support big data and high access frequencies.The only thing you drop is the key range requests. Everything that is on the same row is still on the same node in the cluster, and you can use comparator range and column queries. TL; DR: USE IT FOR PROPER BALANCING, you won’t give up anything basic.
Things you should know about Cassandra:
CASSANDRA EVENTS TOGETHER. Cassandra decided to trade Consistency for high availability and great breakdown ( http://en.wikipedia.org/wiki/CAP_theorem ). BUT you can get consistency from cassandra, it's all about the Consistency policy when you read and write. This is a pretty important and complicated topic when we talk about using cassandra, but you can read more about it here http://www.datastax.com/docs/1.2/dml/data_consistency .
As a rule (and to make it simple), I read and write in QUORUM ConsistencyLevel (since in my applications reading has the same frequency order as writing). If your application writes heavily heavily and reading is much less common, then use write on ONE and read EVERYTHING. Or, if your use case is the other way around (writing is much less common than reading), you can try reading ONE and writing on ALL. Using ANY as a level of consistency for records is not a great idea if the sequence is what you are trying to solve, as it ensures that the mutation reaches the cluster but has not been written anywhere. This is the only time I get notes to calmly fall through on cassandra.
These are simple rules that make working with cassandra easier. To get as much consistency and performance as possible from a production cluster, you should study this topic and understand it yourself.
If you need a human-readable date model with complex relationships between entities (tables), then I don’t think Cassandra is for you. MySQL and possibly NewSQL may be more useful for your use.
It’s good to know how, roughly speaking, cassandra stores and reads data. Whenever you write (it actually deletes the tombstone value in cassandra), the system will put the new value and timestamp in a new physical location.
When you read, cassandra tries to pull out all the records for a specific location key / column_name and returns you the very last thing it could find (the one with the highest timestamp provided by the client). Thus, the memory required for a node directly depends on the write frequencies. In cassandra, a compaction process takes place that takes care of cleaning old mutations. Cassandra has an internal cache that updates when read with the most recent location value.
Merging / compressing SSTables on the disk (data structures that store data) can be caused by reading, but it's better not to rely on it. Cleaning tombstones and expired columns (using the "lifetime" functionality) is another mechanism controlled by the garbage collector (see GC grace time settings for more details).
This leads me to the last point I want to make: make sure that your writing and reading are balanced in your cluster!
Suppose all of your users need to update one location very often.
DO NOT DIRECT this theoretical single place to just one line! This will make all your records fall on only one node in your cluster. If this does not knock everything down (because you have rockstar systops), it will at least greatly degrade cluster performance.
My advice is to write your entries to several different row keys, which you will distribute to all nodes of the cluster. To get all the data for this single theoretical location, use multi_get in all of the "shortcut keys".
Example:
I want to have a list of all active http sessions (which are assigned by uuid). Do not save everything in one line of a session line. What I use as the key of the string for my cassandra cluster of 6 nodes: _sessions. Then I have a small 16 multi_get keys to retrieve all active sessions, or I can still tell if the session is active just using simple get (if I know its uuid, of course). If your cluster is much larger, you can use the hash function to generate bucket keys.