Spark Cassandra Connector keyBy and shuffle

I am trying to optimize my spark by avoiding shuffling as much as possible.

I am using cassandraTable to create an RDD.

The column names of the column family are dynamic, so they are defined as follows:

CREATE TABLE "Profile" (
  key text,
  column1 text,
  value blob,
  PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.010000 AND
  caching='ALL' AND
  ...

This definition results in CassandraRow RDD elements in the following format:

CassandraRow <key, column1, value>
  • key is RowKey
  • column1 - value column1 is the name of the dynamic column
  • value - value of the dynamic column

So, if I have RK = 'profile1', with the columns name = 'George' and age = '34 ', the resulting RDD will be:

CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>

Then I need to group the elements that share the same key to get PairRdd:

PairRdd<String, Iterable<CassandraRow>>

, , , Cassandra node ( ), , .

, groupBy groupByKey . , node:

JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
        .cassandraTable(ks, "Profile")
        .groupBy(new Function<ColumnFamilyModel, String>() {
            @Override
            public String call(ColumnFamilyModel arg0) throws Exception {
                return arg0.getKey();
            }
        })

:

  • keyBy RDD , ?
  • ? mapPartitions, .

,

+4
1

, spanByKey, cassandra, , cassandra, , .

:

sc.cassandraTable("keyspace", "Profile")
  .keyBy(row => (row.getString("key")))
  .spanByKey

:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

+5

All Articles