Cassandra Efficiency for Long Rows

Question

Cassandra Efficiency for Long Rows

I am considering a CF implementation in Cassandra with very long rows (from hundreds of thousands to millions of columns per row).

Using completely dummy data, I inserted 2 million columns in one row (evenly distributed). If I do a slice operation to get 20 columns, then I notice a huge degradation in performance since you are performing a slice operation further down the line.

In most columns, I seem to be able to serve 10-40 ms slice results, but as you get closer to the end of the line, performance drops to the wall, response time gradually increases from 43 ms at 1,800,000, mark 214 ms at 1 900 000 and 435 ms for 1,999,900! (All fragments have the same width).

I find it difficult to explain why such a huge decrease in performance occurs when you get to the end of the line. Can someone please give some recommendations regarding what Kassandra is doing internally to make such a delay? String cropping is disabled, and almost everything is the default setting for Cassandra 1.0 by default.

It is estimated that it will be able to support up to 2 billion columns per row, but at this rate the increase in performance will mean that it cannot be used for very long rows in a practical situation.

Many thanks.

Caution, I click on this with 10 queries in parallel at the same time, so they are a bit slower than I expected anyway, but this is an honest test for all queries and even just doing them all in serial order this is a strange degradation between the 1,800,000 and 1,900,000th record .

I also noticed an extremely poor performance when performing backward fragments for only one element with only 200,000 columns per row: query.setRange (end, start, false, 1);

+8

cassandra

agentgonzo Mar 16 '12 at 17:06

source share

2 answers

A good resource for this is the Aaron Morton blog post on Cassandra Inverted Comparators . From the article:

Recall from my post in the Cassandra Query Plans that after the rows fall in a certain size, they include the column index. And that the whole index should be read whenever it is necessary to use any part of the index, which is the case when using a range of fragments that indicates the beginning or cancellation. Thus, the fastest slice query executed against the row was one that fetched the first X columns in the row, specifying only the number of columns.

If you mostly read from the end of the line (for example, if you store things by the timestamp, and you mostly want to see the latest data), you can use the Reversed Comparator , which stores your columns in descending order. This will give you much better (and more consistent) query performance.

If your reading patterns are more random, you might be better off breaking your data into multiple lines.

+9

psanford Mar 16 '12 at 17:23

source share

agentgonzo · Accepted Answer · 2012-03-19T15:13:15+0000

The psanford comment led me to the answer. It turns out that Cassandra <1.1.0 (currently in beta) has slower slicing performance on long lines in Memtables (which were not flushed to disk), but higher performance on SSTables flushed to disk with the same data .

see http://mail-archives.apache.org/mod_mbox/cassandra-user/201201.mbox/%3CCAA_K6YvZ=vd=Bjk6BaEg41_r1gfjFaa63uNSXQKxgeB-oq2e5A@mail.gmail.com%3E and https://issues.apache / jira / browse / CASSANDRA-3545 .

In my example, the first 1.8 million lines were flushed to disk, so the fragments in this range were fast, but the last 200 000 lines were not flushed to disk and were still in memtables. Since slicing memtables is slower on long lines, this is why I saw poor performance at the end of lines (my data was inserted in column order).

This can be fixed by manually invoking the flash on cassandra nodes. The patch was applied to 1.1.0 to fix this, and I can confirm that this fixes the problem for me.

I hope this helps someone else with the same problem.

Cassandra Efficiency for Long Rows

More articles: