Combine two threads in Apache Flink regardless of window time

I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than another, and there are times when one stream does not receive events at all. Is it possible to use the last event from one thread and attach it to another thread for each upcoming event?

The only solution I found is to use the join function, but you need to specify a common window where you can apply the join function. This window is not reached when one thread does not receive any events.

Is it possible to apply the connection function for each event originating from one thread or another, and maintain the state of the last consumed event and use this event for the connection function?

Thanks in advance for the helpful tips!

+7
join stream apache streaming apache-flink
source share
1 answer

You want to use Flink ConnectedStream with RichCoFlatMapFunction or CoProcessFunction . Any of them will allow you to maintain a managed state (i.e., the last item from a thread with a rare update) and attach it to a faster thread. CoProcessFunction adds the ability to work with timers, which you should use to clear state for expired keys, if necessary.

The Flink training site has an exercise on how to implement such a union: Low latency latency Join .

Update: in Flink 1.5 (not yet released as of February 2018), the SQL library has an implementation of connections not related to the window stream . It stores records in Flink state using MapState<Long, Record> , where Long is the timestamp , and connects by repeating these cards and comparing timestamps. Compared to the training example (see Link above), this has the advantage of only deserializing the records when they are needed.

+5
source share

All Articles