Spark Streaming Lot Size

Question

Spark Streaming Lot Size

I am new to Spark and Spark Streaming. I am working on Twitter streaming data. My task is to deal with each Tweet independently, like counting the number of words in each tweet. From what I read, each input package is generated on RDD in Spark Streaming . Therefore, if I give a batch interval of 2 seconds, then the new RDD contains all the tweets for two seconds, and any conversion applied will be applied to the whole two seconds of the data, and it will not be possible to process individual tweets for two seconds. Do I understand correctly? or does each tweet form a new RDD? I am a bit confused...

+5

scala twitter twitter4j apache-spark spark-streaming

Naren Jun 28 '15 at 1:35

source share

1 answer

zoran jeremic · Accepted Answer · 2015-06-28T01:49:00+0000

In one batch, you have an RDD containing all the statuses that arrive at a 2-second interval. Then you can process these statuses individually. Here is a quick example:

JavaDStream<Status> inputDStream = TwitterUtils.createStream(ctx, new OAuthAuthorization(builder.build()), filters); inputDStream.foreach(new Function2<JavaRDD<Status>,Time,Void>(){ @Override public Void call(JavaRDD<Status> status, Time time) throws Exception { List<Status> statuses=status.collect(); for(Status st:statuses){ System.out.println("STATUS:"+st.getText()+" user:"+st.getUser().getId()); //Process and store status somewhere } return null; }}); ctx.start(); ctx.awaitTermination(); }

I hope I didn’t get you wrong.

Zoran

Spark Streaming Lot Size

More articles: