One approach to speeding up your database is denormalization. Take this MySQL example:
CREATE TABLE `users` ( `user_id` INT NOT NULL AUTO_INCREMENT, β¦ -- Additional user data PRIMARY KEY (`user_id`) ); CREATE TABLE `roles` ( `role_id` INT NOT NULL AUTO_INCREMENT, `name` VARCHAR(64), PRIMARY KEY (`role_id`) ); CREATE TABLE `users_roles` ( `user_id` INT NOT NULL, `role_id` INT NOT NULL, PRIMARY KEY (`user_id`, `role_id`) );
Clean, neat, normalized. But if you want to get users and their roles, the request will be complicated:
SELECT u.*, r.* FROM `users` u LEFT JOIN `user_roles` ur ON u.`user_id` = ur.`user_id` JOIN `roles` r ON ur.`role_id` = r.`role_id`;
If you deny it, it might look something like this:
CREATE TABLE `users` ( `user_id` INT NOT NULL AUTO_INCREMENT, `role` VARCHAR(64), β¦
And the equivalent query:
SELECT * FROM `users`;
This improves some performance characteristics of your queries:
- Since the result you want is already in the table, you do not need to perform read-side calculations. for example, if you want to see the number of users with a given role, you will need
GROUP BY and COUNT . If it were denormalized, you saved it in another table dedicated to storing roles and the number of users who have this role. - The data you want is in the same place and hopefully in the same place on the disk. Instead of requiring a lot of random queries, you can do this with multiple consecutive readings.
NoSQL databases are highly optimized for these cases when you want to access the main static sequential dataset. At this point, it simply moves bytes from disk to the network. Less work, less overhead, more speed. Despite how simple it sounds, you can simulate your data and application to make it natural.
The tradeoff for this performance is the write load, disk space, and some application complexity. Denormalizing your data means more copies, which means more disk space and write load. Essentially, you have one dataset for each query. Since you are transferring the load on these calculations to the write time instead of the read time, you really need some kind of asynchronous mechanism for this, hence the complexity of the application.
And since you need to store more copies, you need to do more write operations. That's why you canβt practically copy such an architecture using an SQL database - it is extremely difficult to scale records.
In my experience, the tradeoff is well worth it for large-scale use. If you want to read a little more about the practical application of Cassandra, I wrote this article a few months ago, and you may find it useful.