What are the disadvantages of pure streaming architecture versus lambda architecture?

Disclaimer: I am not a real-time architecture expert, I would only like to give up a couple of personal considerations and appreciate what others have suggested or indicated.

Suppose we would like to create a real-time analytic system. Following the definition of Lambda architecture Nathan Marz, to serve the data, we need a batch processing layer (i.e. Hadoop), constantly revising the views from the data set of all the data and the so-called speed level (i.e. Storm), which constantly processes a subset of the views (made events that appear after the last full rearrangement of the batch layer). You request your system by combining the results of the two together.

The rationale for this choice makes sense to me, as well as its combination of software development and system engineering observations. The presence of an ever-growing basic data set of unchanging temporary data makes the system resistant to human errors in calculating representations (if you make a mistake, you simply correct it and reprogram it in the batch layer) and allows the system to respond to almost any request that appears in the future. In addition, such a data warehouse should support only random reads and batch inserts, while a data warehouse for the speed / real-time part should support efficient random reads and random writes, increasing its complexity.

My objection / trigger to discuss this is that in certain scenarios this approach may be redundant. For discussion, suppose we make a couple of simplifications:

  • Suppose that in our analytical system we can determine in advance an unchangeable set of use cases / queries that the clock system should provide, and that they will not change in the future.
  • Suppose that we have a limited amount of resources (engineering capacity, infrastructure, etc.) for its implementation. Storage of the entire set of elementary events entering our system, instead of pre-computer representations / aggregates, can be too expensive.
  • Suppose that we successfully minimize the impact of human errors (...).

The system must still be scalable and manage ever-growing traffic and data. Given these observations, I would like to know what will prevent us from creating a fully flow-oriented architecture. I represent an architecture in which events (i.e., Pageviews) are placed inside a stream, it can be RabbitMQ + Storm or Amazon Kinesis, and when consumers of such streams will directly update the necessary views through random entries / updates to NoSQL Database (t .e. MongoDB).

In a first approximation, it seems to me that such an architecture can scale horizontally. The storm could be clustered, and Kinesis expected that QoS could also be reserved in advance. More inbound events will mean more consumers flow, and since they are completely independent, we will not add more. As for the database, having fined it with the correct policy, we will distribute an increasing number of records to an increasing number of fragments. To avoid reading errors, each shard can have one or more read replicas. From a reliability point of view, Kinesis promises reliably stores your messages for up to 24 hours, and distributed RabbitMQ (or any other queuing system of your choice) with the proper use of confirmation mechanisms can probably satisfy the same requirement.

Amazon's Kinesis documentation deliberately (I believe) avoids blocking in a particular architectural solution, but my overall impression is that they would like to force developers to simplify the Lambda architecture and come up with a completely stream-based solution similar to the one I put up. To be more compatible with the requirements of the Lambda architecture, nothing bothers us, in parallel with consumers constantly updating our views, a set of consumers who process incoming events and store them as atomic immutable units in another data warehouse that can be used in the future to create new submissions (e.g. via Hadoop) or re-compromise bad data.

What do you think of this reasoning? I would like to know in which scenarios a purely streaming architecture will not expand, and if you have any other comments, the pros / cons of Lambda architecture and streaming architecture.

+6
source share

All Articles