I am trying to optimize my spark by avoiding shuffling as much as possible.
I am using cassandraTable to create an RDD.
The column names of the column family are dynamic, so they are defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
- key is RowKey
- column1 - value column1 is the name of the dynamic column
- value - value of the dynamic column
So, if I have RK = 'profile1', with the columns name = 'George' and age = '34 ', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group the elements that share the same key to get PairRdd:
PairRdd<String, Iterable<CassandraRow>>
, , , Cassandra node ( ), , .
, groupBy groupByKey . , node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
@Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
:
- keyBy RDD , ?
- ? mapPartitions, .
,