How can I join a direct spark stream with all the data collected by another stream throughout its life cycle?

Question

How can I join a direct spark stream with all the data collected by another stream throughout its life cycle?

I have two spark streams, first of all, the data related to the products comes in: their price for the supplier, currency, their description, supplier ID. These data are enriched by the category guessed by the analysis of the description and the price in dollars. Then they are saved in the parquet data set.

The second stream contains data on the auction of these products, then on the price at which they were sold, and on the date.

Considering the fact that a product can arrive today in the first stream and be sold in a year, how can I join the second stream with all the history contained in the parquet dataset of the first stream?

As a result, it should be clear that average daily income over a price range ...

+6

amazon-kinesis apache-spark pyspark apache-spark-2.0 spark-streaming

Claudio D'Alicandro Jan 17 '18 at 11:26

source share

2 answers

If you use structured streaming in Spark, you can load the parquet files of the first stream into the data framework.

parquetFileDF = spark.read.parquet("products.parquet")

Then you can get your second stream and join the parquet file.

streamingDF = spark.readStream. ...
streamingDF.join(parquetFileDF, "type", "right_join")

.

, .

0

Gourav Dutta 18 . '18 6:51

giorrrgio · Accepted Answer · 2018-01-22T10:23:29+0000

I found a possible solution with snappydata using its mutable DataFrame:

https://www.snappydata.io/blog/how-mutable-dataframes-improve-join-performance-spark-sql

The presented example is very similar to the one described by claudio-dalicandro

How can I join a direct spark stream with all the data collected by another stream throughout its life cycle?

More articles: