How to implement a queue-based workflow system?

I am working on a document management system. An example workflow would be something like this:

  • The document is sent by e-mail to the system
  • The system performs a number of preparatory actions for the document
  • The document is provided to the user for further processing.
  • The document is then sent to Quality Assurance.
  • Subsequently, the system performs operations with the number or post-processing in the document
  • The document is considered fully processed and distributed (for example, sent by e-mail to the person who sent the document by e-mail to the system, etc.).

Since input volume will vary (but usually will be large volume), I strongly agree with scalability.

For example, let's say the system has already downloaded email attachments. If the attachments are PDF documents, the system should split the PDF into separate pages, and then convert each page into thumbnails of several sizes, etc. I plan to check the cron jobs (say, every minute) to see if there are any PDF documents that need to be processed. Using a flag system (for example, "A PDF is ready to be processed"), I can check the database for all PDFs that are marked for processing. Once PDF processing has been completed, the flag can be updated to say "PDF Processing."

However, since processing each PDF is very time consuming, I am worried that when the next cron job is executed, this cron job will also try to process the PDF files that the previous cron job is still processing.

A possible solution is to mark PDF documents immediately with "PDF is currently being processed." Thus, when the next cron job is executed, it excludes those that are already being processed.

Thus, each step in the workflow is likely to have 3 flags:

  • The PDF is ready to process.
  • The PDF document that is currently being processed.
  • PDF Processing Completed

The same for QA:

  • Document ready for quality assurance
  • Document Currently QAd
  • QA Done Document

Is this a good approach? Is there a better approach? Do I have these flags as a single column of the PDF Document table in the database? Or flags should be their own table (for example, especially if several flags can be set in a document).

I would like to request suggestions on how to implement such a system.

+4
source share
2 answers

To solve your parallel processing problem in one document, you can use many scheduler packages to help you manage this aspect. http://www.quartz-scheduler.org/ is the one I used with great success.

To solve your problem, I would get 3 states, receive, queue, process (similar to what you offer).

I would have a scheduled re-task that checks the database, looks for the received pdf files, and for each puts a task in the queue for processing and marking pdf in the queue. If you make sure that this happens in one transaction and use optimistic locking, there is no risk that another work may come and re-read it as received.

Quartz uses a thread pool with possible configuration options and is great for deferred, resource-intensive processing (I use it to sketch an image in the server settings).

To take a step backward, there are several large workflow packages in the java world that can handle most of what you want to do, including deferred processing of PDF files. Take a look at the jbpm or drools stream, these are two large packages if they are complex.

UPDATE: Drools Flow merged with JBPM. For this particular problem, this may be a bit of a “bazooka mosquito killing,” but it's a great workflow package.

+2
source

The type of solution depends on what technologies you use to implement this system, is it preprocessing / post-processing, performed by the same software / language as the email software? In addition, they work in separate processes.

If you have distributed components, you can do much worse than researching an AMQP solution, such as RabbitMQ , as this will take care that every job is in the queue and make sure that only one of your customers accepts every job. (we will model each task in a miniature as separate tasks).

If, however, the entire system is implemented in one language, and within one process there are several simpler systems that you can use:

  • Resque is a good solution for Ruby
  • Java will work as a LinkedBlockingQueue
  • I'm sure C # will have a way to create a job queue (disclaimer: I don't know anything about C #)
0
source

Source: https://habr.com/ru/post/1314262/


All Articles