Some prerequisites before moving on to the real issue:
I am working on a background application that consists of several different modules. Each module, at present, is a java command-line application that runs “on demand” (more on this later).
Each module is a “step”, part of a larger process that you can consider as a data stream; the first step collects data files from an external source and pushes / loads them into some tables of the SQL database; then the following steps, based on different conditions and events (time, availability of data in the database, messages and developments performed through the web service / web interface), take data from (1 or more) database tables, process them, and write them to different tables. Steps are run on three different servers and data is read from three different databases, but written to only one database. The goal is to combine data, calculate indicators and statistics.
Currently, each module is executed periodically (from several minutes / hours for the first modules, to several days for the last in the chain, which should collect more data and therefore wait "longer" from them so that they are available) using cronjob. The module is launched (currently the java console application), and it checks the database for new, raw information in the given datetime window and does its job.
Problem: it works, but .. I need to expand and support it, and this approach is starting to show its limits.
- I do not like to rely on a "survey"; this is a waste, given that the information of the previous modules may be enough to “tell” the other modules about the chain, when the information they need is available and that they can act.
- It is “slow”: there are several days of delay for the modules down the chain, because we need to be sure that the data is received and processed by the previous modules. Therefore, we “stop” these modules until we are sure that we have all the data. New additions require real-time (not difficult, but "as soon as possible") calculations of some indicators. A good example is what happens here on SO with icons! :) I need to get something really similar.
To solve the second problem, I’m going to introduce “partial” or “incremental” calculations: as long as I have a set of relevant information, I process it. Then, when some other related information arrives, I calculate the difference and update the data accordingly, but then I also need to notify the other (dependent) modules.
Question (s)
- 1) What is the best way to do this? - 2) Related: what is the best way to “notify” other modules (java executables, in my case) that the relevant data is available?
I see three ways:
- add other "non-data" tables to the database in which each module writes "Hey, I did it, and it's available." When cronjob starts another module, it reads the table (s), decides that it can compute a subset of xxx and does it. And so on
- use message queues like ZeroMQ (or Apache Camel like @mjn for example) instead of DB tables
- use a keystore like Redis instead of DB tables
Edit: I am convinced that the queue-based approach is the way to go, I added the "table + poll" parameter for completeness, but now I understand that this is just a distraction (obviously, everyone will answer yes, use the queues, polling is evil "- and right!). So let me rephrase the question: What are the advantages / disadvantages of using MQ in a key value store with pub / sub like Redis?
- 3) is there any solution that will help me completely get rid of the crown?
Edit: in particular, in any case, this means: is there a mechanism in some kind of MQ and / or key value repository that allows me to post messages over time? How to “deliver” it in 1 day? ”With persistence and a guarantee of delivery“ almost once ”, obviously
- 4) Should I create this solution (event?) Based on a centralized service, running it as a daemon / service on one of the servers?
- 5) should this idea be abandoned to start subscribers on demand, and that each module works continuously as a daemon / service?
- 6) which are the pro and cons (reliability, single point of failure and resource use and complexity ...)?
Edit: this is a little that I care most about: I would like to “queue” myself to activate “modules” based on messages in the queue, similar to MSMQ activation. Is that a good idea? Is there anything in the Java world that does this, should I implement it myself (over MQ or over Redis), or should I run each module as a daemon? (even if some calculations usually happen in packets, two hours of processing, and then two days of inactivity?)
NOTE. I can not use heavy containers / EJB (not Glassfish or the like)
Edit: The camel also seems too heavy for me. I'm looking for something really bright here , both in terms of resources and in terms of development complexity