Spark Streaming Lot Size

I am new to Spark and Spark Streaming. I am working on Twitter streaming data. My task is to deal with each Tweet independently, like counting the number of words in each tweet. From what I read, each input package is generated on RDD in Spark Streaming . Therefore, if I give a batch interval of 2 seconds, then the new RDD contains all the tweets for two seconds, and any conversion applied will be applied to the whole two seconds of the data, and it will not be possible to process individual tweets for two seconds. Do I understand correctly? or does each tweet form a new RDD? I am a bit confused...

+5
source share
1 answer

In one batch, you have an RDD containing all the statuses that arrive at a 2-second interval. Then you can process these statuses individually. Here is a quick example:

JavaDStream<Status> inputDStream = TwitterUtils.createStream(ctx, new OAuthAuthorization(builder.build()), filters); inputDStream.foreach(new Function2<JavaRDD<Status>,Time,Void>(){ @Override public Void call(JavaRDD<Status> status, Time time) throws Exception { List<Status> statuses=status.collect(); for(Status st:statuses){ System.out.println("STATUS:"+st.getText()+" user:"+st.getUser().getId()); //Process and store status somewhere } return null; }}); ctx.start(); ctx.awaitTermination(); } 

I hope I didn’t get you wrong.

Zoran

+1
source

All Articles