I am new to Spark and I want to implement lambda architecture using spark flow and spark batch.
I found the following article on the Internet:
http://blog.cloudera.com/blog/2014/08/building-lambda-architecture-with-spark-streaming/
This is good for some of my analyzes, but I do not think that this solution is possible if it is necessary to find individual elements.
If you want to find individual elements in JavaRDD, you can use this method. DStreams are RDD sets, so if you apply
transform((rdd) -> rdd.distinct())
in Dstream, you will run separate on each rdd stream so that you can find separate elements in each RDD, and not on the entire DStream.
It might be written as if this is a bit confusing, so let me clarify an example:
I have the following elements:
Apple
Pear
Banana
Peach
Apple
Pear
In a batch application:
JavaRDD<String> elemsRDD=sc.textFile(exFilePath).distinct()
The RDD child will contain:
Apple
Pear
Banana
Peach
If I understood correctly, this should be the behavior for the stream:
Suppose we have a 1s packet time and a 2s window:
First RDD:
Apple
Pear
Banana
Second RDD:
Peach
Apple
Pear
JavaDStream<String> elemsStream=(getting from whathever source)
childStream = elemsStream.transform((rdd) -> rdd.distinct())
childStream.forEachRDD...
will end with 2 Rdds: First:
Apple
Pear
Banana
Secondly:
Peach
Apple
Pear
This is great for RDD, but not for DStream.
My solution for the streaming part was as follows:
JavaDStream<HashSet<String>> distinctElems = elemsStream.map(
(elem) -> {
HashSet<String> htSet = new HashSet<String>();
htSet.add(elem);
return htSet;
}).reduce((sp1, sp2) -> {
sp1.addAll(sp2);
return sp1;
});
So the result is:
Apple
Pear
Banana
Peach
as a batch mode. However, this solution will require unproductive maintenance costs and there is a risk of errors resulting from duplication of code bases.
Is there a better way to achieve the same result using as much code as possible for batch mode?
Thanks in advance.