Depending on your use case, you can set your event topics to never end. There are several considerations here :
1) you can process the source stream from a given Kafka theme only within the storage period of the original theme. For example, if an event occurs and the source is released into the topic at time t0, and this period of topic storage t1, then this event can be processed to t0 + t1.
2) if the messages of your input are not independent, i.e. this is not a write-only stream for applications, but messages logically depend on each other (you can read about it at https://kafka.apache.org/0110/documentation/streams/developer-guide#streams_kstream_ktable ) recycle it at ANY given point in the story. For example, if the input source stream is actually a stream similar to a variable, so that the user creation event occurs at t0, and the user update event occurs at t2, then if you process at time t1, where t0 <t1 < t2, you restart in the "inconsistent" state of the original thread. So, the correct way is to configure the input source theme so that you do not use temporary preservation, but use a keyword-based compression policy.
As the saying goes, if your events are independent, then data processing is quite simple. You may even have expired old segments. But if your messages are interdependent, you will not be able to obsolete old records. For more information about withholding, see this mailing .
How you write the processing logic, and the engine that you use to reprogram it is also important to consider. Suppose you use Kafka threads to handle events in the state you are requesting. If you want to be able to process events to restore this state, in addition to being able to process events in real time during the event, your stream processing engine should be able to coordinate the repetition of historical events during processing so that they are synchronized as during live processing, so your processing logic handles these events the same way. Kafka Streams currently has limited capabilities in this department , so you need to write a topology that does this coordination. Apache Beam doesn't seem to offer much in this department. The blog post mentions :
A common characteristic of archived data is that it can fail out of order. Shards of archived files often lead to a completely different processing order than events arriving in near real time. All data will also be available and therefore will be delivered instantly from the perspective of your pipeline. Regardless of whether you are experimenting with past data or processing past results to correct a data processing error, it is critical that your processing logic be applied to archived events as easily as almost real-time data.
... but the actual ray documentation doesn't mention time or historical processing at all. Therefore, I believe that the emphasis in this paragraph should be on " that it is critical that your processing logic is as easy to apply to archived events as real-time incoming data. "
Thus, it makes sense to consider historical processing as a general case and real-time processing as a special case. To write the logic applicable to both processing modes:
- You should know which operations in your stream processing processor give deterministic results in both historical and live processing modes, and which are not. For example, with Kafka threads, not all operations give the same results in both real and historical processing .
- You may want your processing logic not to depend on runtime factors such as
System.currentTimeMillis() , or to generate random identifiers that will not be the same for each run (unless, of course, you want them to be different) .