JoinWithCassandraTable gets much slower as table size increases

Question

JoinWithCassandraTable gets much slower as table size increases

I am currently using this stack:

Cassandra 2.2 (multi-line)
Spark / Streaming 1.4.1
Spark-Cassandra-Connector 1.4.0-M3

I have a DStream [Ids] with an RDD of around 6000-7000 elements. idis the section key.

val ids: DStream[Ids] = ...
ids.joinWithCassandraTable(keyspace, tableName, joinColumns = SomeColumns("id"))

As we tableNameget more, say, about 30k “lines”, the request takes much longer, and I have problems in order to stay under the threshold of the batch duration. It works similarly to using the massive IN-clause, which I realized was impractical.

Are there any more efficient ways to do this?

Answer: Always remember to remake local RDDs with repartitionByCassandraReplicabefore you make connections to Cassandra to ensure that each partition only works with the local Cassandra node. In my case, I also had to grow partitions when combining a local RDD / DStream so that tasks were distributed evenly between workers.

+4

scala cassandra apache-spark spark-streaming spark-cassandra-connector

kareblak Aug 27 '15 at 13:08

source share

1 answer

Jim Meyer · Accepted Answer · 2015-08-27T18:21:56+0000

Is the "id" the key of a partition in your table? If not, I think it should be, otherwise you can scan the table, which will work more slowly as the table grows.

, , , repartitionByCassandraReplica() RID, node.

. this.

JoinWithCassandraTable gets much slower as table size increases

More articles: