Distributed Worker Architecture

We are creating a website that can distribute tasks across multiple geographical sites. A website should be able to:

  • create a task
  • queue it up
  • assign it to an employee depending on geographic criteria,
  • update the web interface in accordance with the working status (step 1, 2 3, etc.),
  • save the final result in mongodb and notice the web interface.

We may have parallel jobs if they do not match geographic criteria.

We can delete a task if it is not in a processing state.

Our current stack: Angulajs - nodejs - mongodb.

Our first idea was to collect the HTTP pool from remote workers to the mongodb task. The fact is that we will have more than 20 remote workers, and we would like to receive a high-frequency update (<1s). We believe that this solution is easy to implement, but it will be difficult to maintain and overload the database. This solution is highly dependent on network ping.

After some research on the Internet, we found documentation for rabbitMQ and a messaging system. This seems to meet most of our requirements, but I don’t see how we can delete a specific task in the queue in the waiting state and how we can easily cope with updating the status of the task.

We also found documentation about redis, the KV system in RAM. This solves the problem in order to be able to remove a specific task in the queue and reduce the load of mongodb, but we do not see how we can notice that the remote worker must do this work. If this is an HTTP pool, we have lost all the benefits.

Our situation seems to be a common problem and I would like to know what is the best solution?

+7
redis rabbitmq
source share
6 answers

The architecture is interesting, and I think you can use RabbitMQ.

1. "create a task"

you can create an amqp message

2. "put it in the queue"

you can queue it or maybe better on Exchange

3. Assign it to an employee depending on geographical criteria:

you can use the floating plugin and assign the task using the routing key. A plugin is a design that allows you to transfer slow and geographic networks.

4. update the web interface in accordance with the working status (step 1, 2 3, etc.)

It's easy, you can redirect the message to a web page using a web socket, or you can enable the web-STOMP JavaScript plugin "rabbitmq-plugins enable rabbitmq_web_stomp" and use it directly to refresh the page.

5. Get the final result in mongodb and pay attention to the web interface:

After receiving the message, you can save it in the database.

The only thing that just deletes the message a bit is that you can receive messages without the icon, and then send ack to the message you want to delete. In any case, this is the wrong way to use RabbitMQ, I don’t know your environment, but you could use the expiration message via the TTL message ( http://www.rabbitmq.com/ttl.html ).

To update the task, you should not update the message in the queue, but send another message with information about the update, then your application should update the status of the task, for example, in the internal list.

Hope this can be helpful.

+3
source share

Redis

Redis is great because you can use it for functions other than Job Queuing, such as caching. I personally use Kue . Unable to find the best solution for Kueing jobs in data centers. Although I do not understand your circumstances, it is generally accepted that your data model is centralized where, as your content spreads. I am running a service that hosts the San Francisco API and has CDN nodes in San Francisco and New York. My content is server templates, images, scripts, css, etc., which can be completely populated by my API.

Outsource

If you absolutely need this functionality, I personally recommend iron.io. They offer 2 services that can solve your problem. Firstly, they offer the MQ system through the RESTful API, which is very easy to use and works great with node. The Worker service is also offered, which allows you to queue, schedule, and run tasks in your stack. That would be a limitation if you needed to access resources from your own cloud, in which case I would recommend ironMQ.

Insource

If you do not want to outsource your service, and want to host MQ, I would not recommend rabbitMQ for the job queue. I would recommend something like beanstalkd , which is more queue oriented, where, since RabbitMQ is more oriented on the d thunk message queue ?).

Additionally:

After reading some comments on some other answers, it seems to me that beanstalkd may be your best approach. This is more specific to the job queue, while many other MQ systems need to report updates and enter new data through your cloud in real time, and you will have to implement your own Queueing Job system.

+10
source share

Rabbit MQ, Redis and ZeroMQ are awesome, but you can do it without the permission of mongoDB. There are special collections called closed collections that allow streaming, and they work very quickly and cheaply for their database. You can get your workers (or another process) to listen on the queue, and then complete the tasks.

For example, imagine that you have a worker for each region and the lines are marked in the indicated regions. Then we just need to create an internal queue for processing updates in your main logic. We will use mongoose and async to show it:

var internalQueue = async.queue(function (doc, callback) { doc.status = 2; doc.save(function(e){ // We update the status of the task // And we follow from here, doing whatever we want to do }) }, 1); mongoose .TaskModel .find({ status: 1, region: "KH" // Unstarted stuff from Camboya }) .stream() .on('data', function (doc){ internalQueue.push(doc, function(e){ console.log('We have finished our task, alert the web interface or save me or something'); }); }); 

You may not want to use mongoose or async or you want to use geo objects or more than one worker in each region, but you can do this with the tools you already have: mongoDB and Node.js

To get started with private collections, simply use createCollection on the mongoDB terminal:

 db.createCollection('test', {capped: true, size: 100*1000, max: 100} ) 

Just remember two things:

  • The data expires depending on the insertion order, and not the time or last access to this document, so your collections are quite large.
  • You cannot delete a document, but you can simply delete it.
+5
source share

If I work, we use Amazon SQS , and I can highly recommend it to you. It is cheap, reliable, scales and saves you a ton of problems (maintaining a queuing system). We have workers in various Amazon regions around the world.

There is aws-sdk for node, look here for documentation

+2
source share

It’s hard to give useful advice given the size of your question. But, if it were me, I would use ZeroMQ , I would use some options, such as Router-Req, I would support the queue and all the data related to working on the server, and just solve problems that I’m ready to solve for employees, ready to work, on the understanding that they will begin immediate work on a task as they are received, and only the data I need to return to the server to complete the work. If you need the ability to interrupt work in progress, you can use a second pair of sockets to communicate control, possibly a Req-Rep connection.

The socket patterns are fully described in the linked manual, and it will be much better to describe them than I can useful here, although this is the main text.

+1
source share

The https://github.com/jkyberneees/distributed-eventemitter module simplifies the distribution of distributed messages by reusing EventEmitter APIs and STOMP brokers. This will surely help you in the messaging architecture.

0
source share

All Articles