I am writing a test application that consumes messages from Kafka topcis, and then transfers the data to S3 and to RDBMS tables (the stream is similar to that presented here: https://databricks.com/blog/2017/04/26/processing-data-in -apache-kafka-with-structured-streaming-in-apache-spark-2-2.html ). So I read the data from Kafka, and then:
- wants to save every message in S3
- some messages are stored in table A in an external database (based on a filter condition)
- some other messages are stored in table B in an external database (another filter condition)
So I like:
Dataset<Row> df = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2,topic3") .option("startingOffsets", "earliest") .load() .select(from_json(col("value").cast("string"), schema, jsonOptions).alias("parsed_value"))
(note that I read more than one Kafka topic). Then I define the required data sets:
Dataset<Row> allMessages = df.select(.....) Dataset<Row> messagesOfType1 = df.select()
and now for each data set I create a query to start processing:
StreamingQuery s3Query = allMessages .writeStream() .format("parquet") .option("startingOffsets", "latest") .option("path", "s3_location") .start() StreamingQuery firstQuery = messagesOfType1 .writeStream() .foreach(new CustomForEachWiriterType1())
Now I am wondering:
Will these requests be executed in parallel (or one after the other in FIFO order, and I have to assign these requests to separate the scheduler pools)?
mm112 source share