Spark Cassandra Connector keyBy and shuffle

Question

Spark Cassandra Connector keyBy and shuffle

I am trying to optimize my spark by avoiding shuffling as much as possible.

I am using cassandraTable to create an RDD.

The column names of the column family are dynamic, so they are defined as follows:

CREATE TABLE "Profile" (
  key text,
  column1 text,
  value blob,
  PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.010000 AND
  caching='ALL' AND
  ...

This definition results in CassandraRow RDD elements in the following format:

CassandraRow <key, column1, value>

key is RowKey
column1 - value column1 is the name of the dynamic column
value - value of the dynamic column

So, if I have RK = 'profile1', with the columns name = 'George' and age = '34 ', the resulting RDD will be:

CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>

Then I need to group the elements that share the same key to get PairRdd:

PairRdd<String, Iterable<CassandraRow>>

, , , Cassandra node ( ), , .

, groupBy groupByKey . , node:

JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
        .cassandraTable(ks, "Profile")
        .groupBy(new Function<ColumnFamilyModel, String>() {
            @Override
            public String call(ColumnFamilyModel arg0) throws Exception {
                return arg0.getKey();
            }
        })

:

keyBy RDD , ?
? mapPartitions, .

,

+4

cassandra shuffle grouping apache-spark connector

Shai 11 . '15 9:11

1

maasg · Accepted Answer · 2015-03-11T15:07:39+0000

, spanByKey, cassandra, , cassandra, , .

:

sc.cassandraTable("keyspace", "Profile")
  .keyBy(row => (row.getString("key")))
  .spanByKey

:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

Spark Cassandra Connector keyBy and shuffle

More articles: