groupByKey:
Syntax:
sparkContext.textFile("hdfs://") .flatMap(line => line.split(" ") ) .map(word => (word,1)) .groupByKey() .map((x,y) => (x,sum(y)) )
groupByKey can cause disk problems because data is transmitted over the network and collected to reduce workers.
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://") .flatMap(line => line.split(" ")) .map(word => (word,1)) .reduceByKey((x,y)=> (x+y))
Data is combined in each section, only one output for one key in each section for sending over the network. reduceByKey requires combining all your values ββinto another value of the same type.
aggregateByKey:
same as reduceByKey, which takes an initial value.
3 parameters as input i. Initial value II. Combiner Logic III. sequence of operations with logic
*Example:* ` val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D") val data = sc.parallelize(keysWithValuesList)
`
Ouput: Key Accounts Summary Results bar β 3 foo β 5
combineByKey:
3 input options
- Initial value: unlike aggregateByKey, we donβt always need to pass a constant, we can pass a function that will return a new value.
- merge function
- Union function
Example: ``
val result = rdd.combineByKey( (v) => (v,1), ( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) , ( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2)) ).map( { case (k,v) => (k,v._1/v._2.toDouble) }) result.collect.foreach(println)
`
reduceByKey, aggregateByKey, combByKey is preferred over groupByKey
Link: Avoid groupByKey