To make this clear, if you run the following, you will see a “monkey” on the stdout driver:
myDStream.foreachRDD { rdd => println("monkey") }
If you run the following, you will see a “monkey” on the stdout driver, and a work filter will be executed for all artists that rdd distributed by:
myDStream.foreachRDD { rdd => println("monkey") rdd.filter(element => element == "Save me!") }
Add the simplification that myDStream only ever receives one RDD, and this RDD extends over the set of sections that we will call PartitionSetA that exist on MachineSetB where ExecutorSetC works. If you run the following, you will see a “monkey” in the stdout driver, you will see a “tortoise” on the output of all artists in the ExecutorSetC (a “tortoise” will appear once for each section - many sections may include the machine on which the executor is running), and both filter and add operations will be performed through ExecutorSetC :
myDStream.foreachRDD { rdd => println("monkey") rdd.filter(element => element == "Save me!") rdd.foreachPartition { partition => println("turtle") val x = 1 + 1 } }
Another thing to note is that in the following code, y will be transmitted over the network from the driver to all ExecutorSetC for each rdd :
val y = 2 myDStream.foreachRDD { rdd => println("monkey") rdd.filter(element => element == "Save me!") rdd.foreachPartition { partition => println("turtle") val x = 1 + 1 val z = x + y } }
To avoid this overhead, you can use broadcast variables that pass the value from the driver to the executors only once. For instance:
val y = 2 val broadcastY = sc.broadcast(y) myDStream.foreachRDD { rdd => println("monkey") rdd.filter(element => element == "Save me!") rdd.foreachPartition { partition => println("turtle") val x = 1 + 1 val z = x + broadcastY.value } }
To send more complex things in the form of broadcast variables, such as objects that cannot be easily serialized after instantiation, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration. html