MQ process, aggregate and publish data asynchronously

Question

MQ process, aggregate and publish data asynchronously

Some prerequisites before moving on to the real issue:

I am working on a background application that consists of several different modules. Each module, at present, is a java command-line application that runs “on demand” (more on this later).

Each module is a “step”, part of a larger process that you can consider as a data stream; the first step collects data files from an external source and pushes / loads them into some tables of the SQL database; then the following steps, based on different conditions and events (time, availability of data in the database, messages and developments performed through the web service / web interface), take data from (1 or more) database tables, process them, and write them to different tables. Steps are run on three different servers and data is read from three different databases, but written to only one database. The goal is to combine data, calculate indicators and statistics.

Currently, each module is executed periodically (from several minutes / hours for the first modules, to several days for the last in the chain, which should collect more data and therefore wait "longer" from them so that they are available) using cronjob. The module is launched (currently the java console application), and it checks the database for new, raw information in the given datetime window and does its job.

Problem: it works, but .. I need to expand and support it, and this approach is starting to show its limits.

I do not like to rely on a "survey"; this is a waste, given that the information of the previous modules may be enough to “tell” the other modules about the chain, when the information they need is available and that they can act.
It is “slow”: there are several days of delay for the modules down the chain, because we need to be sure that the data is received and processed by the previous modules. Therefore, we “stop” these modules until we are sure that we have all the data. New additions require real-time (not difficult, but "as soon as possible") calculations of some indicators. A good example is what happens here on SO with icons! :) I need to get something really similar.

To solve the second problem, I’m going to introduce “partial” or “incremental” calculations: as long as I have a set of relevant information, I process it. Then, when some other related information arrives, I calculate the difference and update the data accordingly, but then I also need to notify the other (dependent) modules.

Question (s)

~~- 1) What is the best way to do this?~~ ~~- 2) Related: what is the best way to “notify” other modules (java executables, in my case) that the relevant data is available?~~

I see three ways:

add other "non-data" tables to the database in which each module writes "Hey, I did it, and it's available." When cronjob starts another module, it reads the table (s), decides that it can compute a subset of xxx and does it. And so on
use message queues like ZeroMQ (or Apache Camel like @mjn for example) instead of DB tables
use a keystore like Redis instead of DB tables

Edit: I am convinced that the queue-based approach is the way to go, I added the "table + poll" parameter for completeness, but now I understand that this is just a distraction (obviously, everyone will answer yes, use the queues, polling is evil "- and right!). So let me rephrase the question: What are the advantages / disadvantages of using MQ in a key value store with pub / sub like Redis?

3) is there any solution that will help me completely get rid of the crown?

Edit: in particular, in any case, this means: is there a mechanism in some kind of MQ and / or key value repository that allows me to post messages over time? How to “deliver” it in 1 day? ”With persistence and a guarantee of delivery“ almost once ”, obviously

4) Should I create this solution (event?) Based on a centralized service, running it as a daemon / service on one of the servers?
5) should this idea be abandoned to start subscribers on demand, and that each module works continuously as a daemon / service?
6) which are the pro and cons (reliability, single point of failure and resource use and complexity ...)?

Edit: this is a little that I care most about: I would like to “queue” myself to activate “modules” based on messages in the queue, similar to MSMQ activation. Is that a good idea? Is there anything in the Java world that does this, should I implement it myself (over MQ or over Redis), or should I run each module as a daemon? (even if some calculations usually happen in packets, two hours of processing, and then two days of inactivity?)

NOTE. I can not use heavy containers / EJB (not Glassfish or the like)

Edit: The camel also seems too heavy for me. I'm looking for something really bright here , both in terms of resources and in terms of development complexity

+7

java notifications redis message-queue dataflow

Lorenzo dematté Mar 07 '13 at 11:21

source share

3 answers

The queue task description partially sounds like systems based on " enterprise integration patterns " like Apache Camel do.

A pending message can be expressed by constants

from("seda:b").delay(1000).to("mock:result");

or variables, e.g. message header value

 from("seda:a").delay().header("MyDelay").to("mock:result");

+1

mjn Mar 07 '13 at 11:41

source share

1> I suggest using a message queue, I select a queue depending on your requirements, but in most cases any of them will do, I suggest you select a queue based on the JMS (active mq) or AMQP (rabbit mq) protocol and write a simple shell or use those provided by spring -> spring-jms or spring -amqp

2> You can write queue consumers in such a way that they notify your system about the appearance of a new message, for example, in a rabbit, you can implement the MessageListener interface

  public class MyListener implements MessageListener { @Override public void onMessage(Message message) { /* Handle the message */ } }

3> If you use async users, for example, in <2>, you can get rid of all polling and cron jobs

4> Depending on your requirements → If you have millions of events / messages passing through your queue, then doing an intermediate queue level on a centralized server makes sense.

5> If resource consumption is not a problem, then keeping your customers / subscribers working all this time is the easiest way. if these consumers are distributed, then you can organize them using a service like zookeeper

6> Scalability → Most queuing systems provide easy distribution of messages, so provided that your customers are stateless, then scaling is possible only by adding new customers and some configuration.

+1

winash Mar 15 '13 at 5:59

source share

Lorenzo dematté · Accepted Answer · 2013-07-25T13:37:44+0000

After implementation, I feel that the answer to my own question may be useful for people who come and visit StackOverflow in the future.

In the end, I went with Radish. It is very fast and scalable. And I really like its flexibility: it is much more flexible than message queues. I claim Redis is better in MQ than the different MQs there? Well, in my particular case, I think so. The fact is that if something is not offered out of the box, you can build it (usually using MULTI - but you can even use LUA for more pre-configuration!).

For example, I followed up with this good answer to implement a “persistent” restored pub / sub (ie pub / sub, which allows clients to die and reconnect without losing messages).

This helped me with both my scalability and my “reliable” requirements: I decided to keep every part of the independent pipeline (now it is a deamon), but add a monitor that checks the lists / queues on Redis; if something is not consumed (or consumed too slowly), the monitor issues a new consumer. I also think that I am really “resilient” and add the ability of consumers to kill themselves when there is no work.

Another example: performing planned actions. I am following this approach , which is still quite popular. But I want to try key notifications to see if a combination of expiring keys and notifications can be an excellent approach.

Finally, as a library for accessing Redis, my choice went to Jedis: it is popular, supported, and provides a nice interface for implementing pub / sub as listeners. This is not the best (idiomatic) approach with Scala, but it works well.

MQ process, aggregate and publish data asynchronously

Question (s)

More articles: