Please recommend an alternative to Microsoft HPC

Question

Please recommend an alternative to Microsoft HPC

We strive to implement a distributed system in a cluster that will perform resource-intensive calculations based on images using input-output operations with large storage volumes having the following characteristics:

There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible.
It is built around the concept of the assignment. A task can have from one to 100,000 tasks.
A task initiated by the user in the node manager leads to the creation of tasks on the node calculation.
Tasks create other tasks on the fly.
Some tasks can be completed in a few minutes, while others can take many hours.
Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly.
The task may be paused and resumed later.
Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks.
Tasks report their progress and return to the manager.
The manager knows the waiting or hanging tasks.

We found that Windows HPC Server 2008 (HPCS) R2 is very close in concept to what we need. However, there are several important disadvantages:

Task creation becomes exponentially slower with an increase in the number of tasks. Sending over several thousand tasks is unbearable in terms of time.
The task cannot inform the manager about its progress, only the task.
During execution, there is no connection with the task, which makes it impossible to check whether the task is running or a reboot may be required.
HPCS only knows nodes, CPU cores, and memory as units of resources. We cannot enter our own units of resources (for example, free disk space, user hardware devices, etc.).

Here's my question: does anyone know and / or have experience working with a distributed computing environment that can help us? We are using Windows.

+6

windows cluster-computing hpc distributed-computing

Pavel Radzivilovsky Jun 30 '10 at 12:15

source share

8 answers

Andrew walker · Answer 1 · 2010-06-30T12:57:22+0000

I would look at a high throughput Condor project. It supports Windows clients and servers (both linux and OSX), handles complex dependencies between tasks using DAGman, and can pause (and even move) jobs. I have experience with Condor-based systems that scale to thousands of machines on campus.

Stan graves · Answer 2 · 2010-07-13T14:26:11+0000

The LSF platform will do everything you need. It works on windows. It is commercial and can be purchased with support.

Yes. 1. There is a dedicated manager computer node and up to 100 computing nodes. The cluster must be easily extensible.

Yes 2. It is built around the concept of the assignment. A task can have from one to 100,000 tasks.

Yes 3. The task initiated by the user in the node manager leads to the creation of tasks on the node calculation.

Yes 4. Tasks create other tasks on the fly.

Yes 5. Some tasks can be completed within minutes, while others may take many hours.

Yes 6. Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly.

Yes 7. The task may be suspended and resumed later.

Yes 8. Each task requires certain resources in terms of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks.

Yes 9. Tasks report their progress and return to the manager.

Yes 10. The manager knows if the task is alive or hung.

Craig trader · Answer 3 · 2010-06-30T12:41:39+0000

Have you watched Beowulf ? Many distributions to choose from and many customization options. You must find something to meet your needs ...

johnny26 · Answer 4 · 2010-07-01T09:15:49+0000

I would recommend Beowulf because Beowulf behaves more like a separate machine than many workstations.

Nikolaus Gradwohl · Answer 5 · 2010-07-07T13:22:23+0000

give gridgain a try. This should make it easier to add nodes at runtime, and you can control / manage the cluster using jmx interfaces

Adrian grigore · Answer 6 · 2010-07-11T10:48:04+0000

If you don't mind hosting your project in the cloud, you can check out Windows Azure / Appfabric . AFAIK allows you to distribute your tasks using workflows, and you can dynamically add additional work machines to process your tasks as the load increases.

John channing · Answer 7 · 2010-08-04T22:25:35+0000

You can solve this problem using Data Synapse Grid Server .

There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible. Yes, a broker can easily handle 2,000 engines.
It is built around the concept of the assignment. A job can have from one to 100,000 tasks. Yes, I queued over 250,000 tasks without problems. In the end, you run out of memory.
A task initiated by the user in the node manager leads to the creation of tasks on the node calculation. Yes
Tasks create other tasks on the fly. This can be done, although I would not recommend such a model
Some tasks can be completed in a few minutes, while others can take many hours. Yes
Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly. yes, but I would deal with this outside of the grid computing infrastructure
Work may be suspended and resumed later. Yes
Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks. Yes
Tasks talk about progress and return to the manager. Yes

`10. The manager knows waiting or hanging. Yes

Paul nathan · Answer 8 · 2010-07-09T21:59:33+0000

Have you studied the SunGrid Engine ? I used it for a long time, and I never used it for my abilities, but this is my understanding.

There is a dedicated manager node computer and up to 100 computing nodes. The cluster must be easily extensible. Yes
It is built around the concept of the assignment. A job can have from one to 100,000 tasks. not sure
A task initiated by the user in the node manager leads to the creation of tasks on the node calculation. Yes
Tasks create other tasks on the fly. I think so?
Some tasks can be completed in a few minutes, while others can take many hours. Yes
Tasks are performed in accordance with a hierarchy of dependencies that can be updated on the fly. not sure
Work may be suspended and resumed later. not sure
Each task requires certain resources from the point of view of the central processor (core), memory and local space on the hard disk. The manager should be aware of this when planning tasks. confidently
Tasks talk about progress and return to the manager. confidently

`10. The manager knows the expectation or hanging. Yes

Please recommend an alternative to Microsoft HPC

More articles: