I executed the solution for the RDD[K, V] group by key and calculated the data according to each group (K, RDD[V]) using partitionBy and Partitioner . However, I am not sure if it is really effective, and I would like to have your point of view.
Here is an example: according to the list [K: Int, V: Int] , calculate the value of V for each group K , knowing that it must be distributed and that the values โโof V can be very large. This should give:
List[K, V] => (K, mean(V))
Simple Partitioner Class:
class MyPartitioner(maxKey: Int) extends Partitioner { def numPartitions = maxKey def getPartition(key: Any): Int = key match { case i: Int if i < maxKey => i } }
Section Code:
val l = List((1, 1), (1, 8), (1, 30), (2, 4), (2, 5), (3, 7)) val rdd = sc.parallelize(l) val p = rdd.partitionBy(new MyPartitioner(4)).cache() p.foreachPartition(x => { try { val r = sc.parallelize(x.toList) val id = r.first()
Output:
1->13, 2->4, 3->7
My questions:
- what does this really happen when
partitionBy called? (sorry, I did not find enough specifications on it). - Is it really useful to display by sections, knowing that in my production case it will not be too many keys (like 50 for a sample) to very large values โโ(like 1 million for a sample).
- What is the cost of
paralellize(x.toList) ? Is this consistent with this? (I need RDD on mean() input) - How do you do it yourself?
Hi
apache-spark rdd
Seb
source share