I am currently using this stack:
- Cassandra 2.2 (multi-line)
- Spark / Streaming 1.4.1
- Spark-Cassandra-Connector 1.4.0-M3
I have a DStream [Ids] with an RDD of around 6000-7000 elements. idis the section key.
val ids: DStream[Ids] = ...
ids.joinWithCassandraTable(keyspace, tableName, joinColumns = SomeColumns("id"))
As we tableNameget more, say, about 30k “lines”, the request takes much longer, and I have problems in order to stay under the threshold of the batch duration. It works similarly to using the massive IN-clause, which I realized was impractical.
Are there any more efficient ways to do this?
Answer: Always remember to remake local RDDs with repartitionByCassandraReplicabefore you make connections to Cassandra to ensure that each partition only works with the local Cassandra node. In my case, I also had to grow partitions when combining a local RDD / DStream so that tasks were distributed evenly between workers.
source
share