I am trying to understand work reduceByKeyin Spark using java as a programming language.
Say that I have the sentence "I am the one who is." I break the sentence into words and save it as a list [I, am, who, I, am].
Now this function assigns to 1each word:
JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
So, the output looks something like this:
(I,1)
(am,1)
(who,1)
(I,1)
(am,1)
Now, if I have 3 reducers, each reducer will receive a key and the values associated with this key:
reducer 1:
(I,1)
(I,1)
reducer 2:
(am,1)
(am,1)
reducer 3:
(who,1)
I wanted to know
and. What exactly happens here in the function below.
b. What are the parameters new Function2<Integer, Integer, Integer>
c. Basically, how JavaPairRDD is formed.
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});