Method of self-reorganizing the job queue

I have a job queue (using Amazon SQS) that gives jobs to many machines to retrieve and process various documents via HTTP. There are hundreds of different hosts accessed, and there is no predictable order for tasks.

To be polite, I do not want my system to repeatedly clog on the same host. So, if I get task number 123 to extract something from example.com, but I see that I just selected another thing from example.com in the past X seconds, I have to go to something else and save task number 123 for later.

The question is, what is a good way to implement this template?

It seems that the first step would be for task runners to list somewhere from all domains, and for the last time something in that domain was available. I suppose this could be a simple DB table.

There are many possible options for what to do if the message handler receives a job that should be delayed.

  • Just click a copy of the message at the end of the queue and discard it without executing it. Hopefully, the next time this happens, enough time. This can lead to a large number of redundant SQS messages, especially if a large cluster of tasks for the same domain passes immediately.

  • Sleep is needed for many seconds until politeness decides that the task can be completed. This can cause many queue processors to do nothing at the same time.

  • Accept the job, but save it in a local queue somewhere on each queue processor. I assume that each processor can β€œrequire” several jobs this way, and then chooses to process them in any order, ensuring maximum courtesy. This can be unpredictable, since each processor in the queue needs to know about the domains affected by everyone else.

  • Set up separate queues for each domain and complete one process dedicated to each queue. Each process should have been suspended for X seconds between each task, so there is a lot of overhead for sleep, but maybe it's not so bad.

Do you have experience in developing these kinds of things? What strategy would you recommend?

+6
design-patterns parallel-processing perl amazon-sqs job-queue
source share
2 answers

Separate queues for each domain and domain queue.

Each processor must:

  • Select a domain from the domain queue.
  • If the domain has not been recently updated, select the top task from the domain queue.
  • Bring the domain back to the end of the domain queue.
  • If we have a task to accomplish, do it.
  • Sleep until it is time to check the domain queue header or domain update.

This can help if you arrange the domain queue as a time-priority queue - keep the domains in the order of the next update time.

+2
source share

I would recommend setting up a queue for each domain and one processor per queue.

Most servers should not have problems with requests issued continuously in a series if you monitor the total number of translations (for example, you should avoid indexing files above several hundred KB if you do not have a real need for this).

I assume that you also obey robots.txt rules too.

0
source share

All Articles