JoinWithCassandraTable gets much slower as table size increases

I am currently using this stack:

  • Cassandra 2.2 (multi-line)
  • Spark / Streaming 1.4.1
  • Spark-Cassandra-Connector 1.4.0-M3

I have a DStream [Ids] with an RDD of around 6000-7000 elements. idis the section key.

val ids: DStream[Ids] = ...
ids.joinWithCassandraTable(keyspace, tableName, joinColumns = SomeColumns("id"))

As we tableNameget more, say, about 30k “lines”, the request takes much longer, and I have problems in order to stay under the threshold of the batch duration. It works similarly to using the massive IN-clause, which I realized was impractical.

Are there any more efficient ways to do this?

Answer: Always remember to remake local RDDs with repartitionByCassandraReplicabefore you make connections to Cassandra to ensure that each partition only works with the local Cassandra node. In my case, I also had to grow partitions when combining a local RDD / DStream so that tasks were distributed evenly between workers.

+4
source share
1 answer

Is the "id" the key of a partition in your table? If not, I think it should be, otherwise you can scan the table, which will work more slowly as the table grows.

, , , repartitionByCassandraReplica() RID, node.

. this.

+3

All Articles