Please recommend an alternative to Microsoft HPC

We strive to implement a distributed system in a cluster that will perform resource-intensive calculations based on images using input-output operations with large storage volumes having the following characteristics:

  1. There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible.
  2. It is built around the concept of the assignment. A task can have from one to 100,000 tasks.
  3. A task initiated by the user in the node manager leads to the creation of tasks on the node calculation.
  4. Tasks create other tasks on the fly.
  5. Some tasks can be completed in a few minutes, while others can take many hours.
  6. Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly.
  7. The task may be paused and resumed later.
  8. Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks.
  9. Tasks report their progress and return to the manager.
  10. The manager knows the waiting or hanging tasks.

We found that Windows HPC Server 2008 (HPCS) R2 is very close in concept to what we need. However, there are several important disadvantages:

  • Task creation becomes exponentially slower with an increase in the number of tasks. Sending over several thousand tasks is unbearable in terms of time.
  • The task cannot inform the manager about its progress, only the task.
  • During execution, there is no connection with the task, which makes it impossible to check whether the task is running or a reboot may be required.
  • HPCS only knows nodes, CPU cores, and memory as units of resources. We cannot enter our own units of resources (for example, free disk space, user hardware devices, etc.).

Here's my question: does anyone know and / or have experience working with a distributed computing environment that can help us? We are using Windows.

+6
windows cluster-computing hpc distributed-computing
source share
8 answers

I would look at a high throughput Condor project. It supports Windows clients and servers (both linux and OSX), handles complex dependencies between tasks using DAGman, and can pause (and even move) jobs. I have experience with Condor-based systems that scale to thousands of machines on campus.

+6
source share

The LSF platform will do everything you need. It works on windows. It is commercial and can be purchased with support.

Yes. 1. There is a dedicated manager computer node and up to 100 computing nodes. The cluster must be easily extensible.

Yes 2. It is built around the concept of the assignment. A task can have from one to 100,000 tasks.

Yes 3. The task initiated by the user in the node manager leads to the creation of tasks on the node calculation.

Yes 4. Tasks create other tasks on the fly.

Yes 5. Some tasks can be completed within minutes, while others may take many hours.

Yes 6. Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly.

Yes 7. The task may be suspended and resumed later.

Yes 8. Each task requires certain resources in terms of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks.

Yes 9. Tasks report their progress and return to the manager.

Yes 10. The manager knows if the task is alive or hung.

+2
source share

Have you watched Beowulf ? Many distributions to choose from and many customization options. You must find something to meet your needs ...

0
source share

I would recommend Beowulf because Beowulf behaves more like a separate machine than many workstations.

0
source share

give gridgain a try. This should make it easier to add nodes at runtime, and you can control / manage the cluster using jmx interfaces

0
source share

If you don't mind hosting your project in the cloud, you can check out Windows Azure / Appfabric . AFAIK allows you to distribute your tasks using workflows, and you can dynamically add additional work machines to process your tasks as the load increases.

0
source share

You can solve this problem using Data Synapse Grid Server .

  • There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible. Yes, a broker can easily handle 2,000 engines.
  • It is built around the concept of the assignment. A job can have from one to 100,000 tasks. Yes, I queued over 250,000 tasks without problems. In the end, you run out of memory.
  • A task initiated by the user in the node manager leads to the creation of tasks on the node calculation. Yes
  • Tasks create other tasks on the fly. This can be done, although I would not recommend such a model
  • Some tasks can be completed in a few minutes, while others can take many hours. Yes
  • Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly. yes, but I would deal with this outside of the grid computing infrastructure
  • Work may be suspended and resumed later. Yes
  • Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks. Yes
  • Tasks talk about progress and return to the manager. Yes

`10. The manager knows waiting or hanging. Yes

0
source share

Have you studied the SunGrid Engine ? I used it for a long time, and I never used it for my abilities, but this is my understanding.

  • There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible. Yes
  • It is built around the concept of the assignment. A job can have from one to 100,000 tasks. not sure
  • A task initiated by the user in the node manager leads to the creation of tasks on the node calculation. Yes
  • Tasks create other tasks on the fly. I think so?
  • Some tasks can be completed in a few minutes, while others can take many hours. Yes
  • Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly. not sure
  • Work may be suspended and resumed later. not sure
  • Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks. confidently
  • Tasks talk about progress and return to the manager. confidently

`10. The manager knows the expectation or hanging. Yes

-one
source share

All Articles