I am working on a document management system. An example workflow would be something like this:
- The document is sent by e-mail to the system
- The system performs a number of preparatory actions for the document
- The document is provided to the user for further processing.
- The document is then sent to Quality Assurance.
- Subsequently, the system performs operations with the number or post-processing in the document
- The document is considered fully processed and distributed (for example, sent by e-mail to the person who sent the document by e-mail to the system, etc.).
Since input volume will vary (but usually will be large volume), I strongly agree with scalability.
For example, let's say the system has already downloaded email attachments. If the attachments are PDF documents, the system should split the PDF into separate pages, and then convert each page into thumbnails of several sizes, etc. I plan to check the cron jobs (say, every minute) to see if there are any PDF documents that need to be processed. Using a flag system (for example, "A PDF is ready to be processed"), I can check the database for all PDFs that are marked for processing. Once PDF processing has been completed, the flag can be updated to say "PDF Processing."
However, since processing each PDF is very time consuming, I am worried that when the next cron job is executed, this cron job will also try to process the PDF files that the previous cron job is still processing.
A possible solution is to mark PDF documents immediately with "PDF is currently being processed." Thus, when the next cron job is executed, it excludes those that are already being processed.
Thus, each step in the workflow is likely to have 3 flags:
- The PDF is ready to process.
- The PDF document that is currently being processed.
- PDF Processing Completed
The same for QA:
- Document ready for quality assurance
- Document Currently QAd
- QA Done Document
Is this a good approach? Is there a better approach? Do I have these flags as a single column of the PDF Document table in the database? Or flags should be their own table (for example, especially if several flags can be set in a document).
I would like to request suggestions on how to implement such a system.
source share