I have a job queue (using Amazon SQS) that gives jobs to many machines to retrieve and process various documents via HTTP. There are hundreds of different hosts accessed, and there is no predictable order for tasks.
To be polite, I do not want my system to repeatedly clog on the same host. So, if I get task number 123 to extract something from example.com, but I see that I just selected another thing from example.com in the past X seconds, I have to go to something else and save task number 123 for later.
The question is, what is a good way to implement this template?
It seems that the first step would be for task runners to list somewhere from all domains, and for the last time something in that domain was available. I suppose this could be a simple DB table.
There are many possible options for what to do if the message handler receives a job that should be delayed.
Just click a copy of the message at the end of the queue and discard it without executing it. Hopefully, the next time this happens, enough time. This can lead to a large number of redundant SQS messages, especially if a large cluster of tasks for the same domain passes immediately.
Sleep is needed for many seconds until politeness decides that the task can be completed. This can cause many queue processors to do nothing at the same time.
Accept the job, but save it in a local queue somewhere on each queue processor. I assume that each processor can βrequireβ several jobs this way, and then chooses to process them in any order, ensuring maximum courtesy. This can be unpredictable, since each processor in the queue needs to know about the domains affected by everyone else.
Set up separate queues for each domain and complete one process dedicated to each queue. Each process should have been suspended for X seconds between each task, so there is a lot of overhead for sleep, but maybe it's not so bad.
Do you have experience in developing these kinds of things? What strategy would you recommend?
design-patterns parallel-processing perl amazon-sqs job-queue
friedo
source share