Code reuse between spark flow and batch mode for individual elements

Question

Code reuse between spark flow and batch mode for individual elements

I am new to Spark and I want to implement lambda architecture using spark flow and spark batch.

I found the following article on the Internet:

http://blog.cloudera.com/blog/2014/08/building-lambda-architecture-with-spark-streaming/

This is good for some of my analyzes, but I do not think that this solution is possible if it is necessary to find individual elements.

If you want to find individual elements in JavaRDD, you can use this method. DStreams are RDD sets, so if you apply

transform((rdd) -> rdd.distinct())

in Dstream, you will run separate on each rdd stream so that you can find separate elements in each RDD, and not on the entire DStream.

It might be written as if this is a bit confusing, so let me clarify an example:

I have the following elements:

Apple
Pear
Banana
Peach
Apple
Pear

In a batch application:

JavaRDD<String> elemsRDD=sc.textFile(exFilePath).distinct()

The RDD child will contain:

Apple
Pear
Banana
Peach

If I understood correctly, this should be the behavior for the stream:

Suppose we have a 1s packet time and a 2s window:

First RDD:

Apple  
Pear
Banana

Second RDD:

Peach
Apple
Pear

JavaDStream<String> elemsStream=(getting from whathever source)
childStream = elemsStream.transform((rdd) -> rdd.distinct())
childStream.forEachRDD...

will end with 2 Rdds: First:

Apple  
Pear
Banana

Secondly:

Peach
Apple
Pear

This is great for RDD, but not for DStream.

My solution for the streaming part was as follows:

JavaDStream<HashSet<String>> distinctElems = elemsStream.map(
                (elem) -> {
                    HashSet<String> htSet = new HashSet<String>();
                    htSet.add(elem);
                    return htSet;
                }).reduce((sp1, sp2) -> {
                    sp1.addAll(sp2);
                    return sp1;
                });

So the result is:

Apple
Pear
Banana
Peach

as a batch mode. However, this solution will require unproductive maintenance costs and there is a risk of errors resulting from duplication of code bases.

Is there a better way to achieve the same result using as much code as possible for batch mode?

Thanks in advance.

+4

java apache-spark spark-streaming

arkrad87 03 . '14 15:04

1

fhuertas · Answer 1 · 2015-02-27T10:17:18+0000

.

, , , , . , mapToPairFunction

JavaPairDStream<String, Integer> distinctElems = elemsStream
       .mapToPair(event -> new Tuple2<String, Integer>(event,1));
distinctElems = distinctElems.reduceByKey((t1, t2) -> t1);

, , .

Code reuse between spark flow and batch mode for individual elements

More articles: