How to create a system with 1500 servers that instantly provide results?

I want to create a system that provides a user interface response within 100 ms, but takes minutes to calculate. Fortunately, I can divide it into very small parts so that I can distribute it to many servers, say 1,500 servers. The request will be delivered to one of them, which then redistributes up to 10-100 other servers, which then redistribute, etc., And after doing the math, the results are again distributed and returned by one server. In other words, something similar to Google Search.

The problem is which technology should I use? Cloud computing seems obvious, but 1,500 servers should be ready to do their job, with available data for specific tasks. Can this be done using any of the existing cloud computing platforms? Or should I create 1,500 different cloud computing applications and download them all?

Edit: Dedicated physical servers do not make sense, because the average load will be very very small. Therefore, it also does not make sense that we start the servers themselves - there must be some kind of shared servers for the external provider.

Edit2: I basically want to buy 30 minutes of the processor, and I'm ready to spend up to $ 3,000 on it, which is equivalent to $ 144,000 per processor day. The only criterion is that these 30 minutes of the processor apply to 1,500 responsive servers.

Edit3: I expect the solution will be like “Use Google Apps, create 1,500 applications and deploy them” or “Contact XYZ” and write an asp.net script that their service can deploy, and you pay them based on the amount of CPU time you use "or something like that.

Edit4: a low-cost web service provider offering asp.net for $ 1 / month will actually solve the problem (!) - I could create 1,500 accounts and latency is fine (I checked) and everything will be fine - except that I need 1,500 accounts on different servers, and I don’t know a single provider that has enough servers that can distribute my accounts on different servers. I fully understand that the delay will vary from server to server, and some may be unreliable - but this can be resolved in the software by retrying on different servers.

Edit5: I just tried this and compared a low-level web service provider for $ 1 / month. They can perform node calculations and deliver the results to my laptop in 15 ms if they are preloaded. Preloading can be done by submitting a request shortly before the actual performance. If the node does not respond within 15 ms, this part of the node task can be distributed to several other servers, of which it will most likely respond within 15 ms. Unfortunately, they do not have 1,500 servers, and so I ask here.

+4
source share
13 answers

[in advance, apologize to the group for using part of the answer space for meta-like questions]

From OP, Lars D:
I do not consider this answer to be the answer to the question, because it does not bring me closer to a solution. I know what cloud computing is, and I know that the algorithm can be ideally divided into more than 300,000 servers, if necessary, although the additional costs will not give additional additional performance due to network latency.

Lars
I sincerely apologize for reading and answering your question on a naive and general level. I hope you will see how the lack of specificity of the question itself, especially in its original form, as well as the somewhat unusual nature of the problem (1) would prompt me to answer the question in this way. This and the fact that such questions about SO, as a rule, come from hypotheses by people who think little and research this process, are my excuses for believing that I, who are indifferent (to massively distributed systems), could help your quest. Many of these answers (some of which were useful for the additional information that you provided), as well as numerous comments and additional questions addressed to you, indicate that I was not alone with this thinking.

(1) Non-obvious problem: [apparently] mainly a computational process (without mention of distributed / replicated storage structures), very highly parallelizable (1,500 servers), into fifty millisecond-sized tasks that together provide a sub-second answer (? For consumption man?). And yet a process that will only be needed a few times [daily ..?].

Just look back!
In practical terms, you can consider some of the following to help improve this SO question (or move it to other / alternative questions) and, therefore, facilitate assistance from domain experts .

  • re-publication as a separate (more specific) issue. In fact, perhaps a few questions: for example. on the [probable] weak latency and / or overhead of redistribution processes, at current prices (for specific TOS data and volume), on awareness of distributed processes from different suppliers, etc.
  • Change title
  • Add information about the process that you have (see the many questions in the notes for both the question and the many answers).
  • in some issues, add tags specific to the supplier or the delivery method (EC2, Azure ...), as this may lead to the possibility that it is not completely unbuyist, but it’s useful, nevertheless, a comment from agents in these companies
  • Show that you understand that your quest has a slightly higher order.
  • I explicitly declare that you want feedback from effective performers of basic technologies (perhaps also include people who also “wash their feet” with these technologies, because, with the exception of people with physics / high energy, etc. that BTW has traditionally worked with clusters, not with clouds, many of the technologies and practices are relatively new)

In addition, I will be happy to give you a hint (with an implicit non-veto from other people on this page) to remove my answer if you find that this will help improve the answers.

- original answer -

Warning: Not all processes or mathematical calculations can be easily divided into separate parts, which can then be performed in parallel ...

Perhaps you can check out the Wikipedia entry for Cloud Computing , realizing that cloud computing, however, is not the only architecture that allows parallel computing.

If your process / calculation can be effectively divided into parallelizable parts, perhaps you can look at Hadoop , or other MapReduce implementations, for a general understanding of these parallel processes. In addition, (and, I believe, using the same or similar algorithms), there are also commercially available structures such as EC2 from amazon .

Beware, however, that the above systems are not particularly well suited for very fast response times. They work better with clocks (and then with some) hashing data / numbers and similar tasks, rather than with small calculations like the one you want to parallelize, so it gives results in 1/10 of a second.

The above frameworks are general, in a sense, that they can trigger processes of almost any nature (again, those that can be at least partially separated), but there are also various suggestions for specific applications, such as search or DNA matching, etc. .d. Search applications, in particular, can have a very short response time (for example, Google) and BTW, this is partly due to the fact that such tasks can be easily and quickly transferred for parallel processing.

+8
source

Sorry, but you expect too much.

The problem is that you expect to pay only processing power. But your main limitation is latency, and you expect this to come for free. This does not work. You need to find out what your latency budgets are.

Simply combining data from multiple computing servers will take several milliseconds per level. There will be a Gaussian distribution, so with 1500 servers the slowest server will respond after 3σ. Since there will be a need for hierarchy, the second level with 40 servers, where again you will wait for the slowest server.

Internet transitions also add up quickly; this should also take 20 to 30 ms of your delay budget.

Another consideration is that these hypothetical servers will spend most of their time downtime. This means that they are connected to the mains, but do not generate income. Any party with so many idle servers will shut them down, or at least in sleep mode, to save energy.

+5
source

MapReduce is not a solution! Map Reduce is used by Google, Yahoo and Microsoft to create indexes from huge data (all over the Web!) That they have on their disk. This task is huge, and Map Reduce was created so that it happens in hours, not in years, but the start of the main Map Reduce controller is already 2 seconds, so for your 100 ms this is not an option.

Now from Hadoop you can take advantage of the distributed file system. This may allow you to distribute tasks close to where the data is physically, but what is it. BTW: Configuring and Managing a Hadoop Distributed File System Means Managing 1,500 Servers!

Frankly, in your budget I do not see any “cloud” service that will allow you to rent 1,500 servers. The only viable solution is renting the time for a Grid Computing solution such as Sun and IBM, but they want you to take on the processor clock from what I know.

BTW: On Amazon EC2, you have a new server in a couple of minutes, which you need to follow for at least an hour!

Hope you find a solution!

+2
source

I don’t understand why you would like to do this, only because "our user interfaces are usually aimed at completing all actions in less than 100 ms, and this should include criteria."

First, “strive for”! = “Need”, this is a guide, why would you introduce this massive process precisely because of this. Consider 1500 ms x 100 = 150 seconds = 2.5 minutes. Reducing 2.5 minutes to a few seconds is his much healthier goal. There is a place where we process your request along with the animation.

So, my answer to this is to post a modified version of the question with reasonable goals: a few seconds, 30-50 servers. I have no answer to this question, but the question posted here seems to be wrong. There may even be 6-8 multiprocessor servers.

+2
source

Google does this with a giant farm of small, networked Linux servers. They use the Linux flavor that they modified for their search algorithms. Costs - software development and cheap PCs.

+1
source

It would seem that you really expect at least 1000-fold acceleration from the distribution of your work on several computers. That might be good. However, your latency requirement seems complicated.

Have you considered delays inherent in job assignment? In fact, computers should be close enough to each other so as not to cope with the speed of light. In addition, the data center in which the machines will be back should be close enough to your client so that you can receive your request to them and back in less than 100 ms. On the same continent, at least.

Also note that any additional delay requires many more nodes in the system. Losing 50% of the available computation time before latency or anything else that does not parallelize requires doubling the processing power of the parallel parts in order to keep up.

I doubt that a cloud computing system would be best suited for such a problem. My impression, at least, is that proponents of cloud computing prefer not to even tell you where your machines are. Of course, I did not see any latent terms in the SLA that are available.

+1
source

You have conflicting requirements. You need latency for 100 ms, which directly contradicts your desire to only periodically run your program.

One of the characteristics of a Google-type approach mentioned in your question is that the cluster latency depends on the slowest node. Thus, you could respond to 1499 machines in less than 100 ms, but if one machine took longer, say 1s - because of a retry or because you had to put the application into the application or poor communication, your entire cluster will take 1 s to create an answer. This is inevitable with this approach.

The only way to achieve the desired latencies is to ensure that all computers in your cluster store your program in RAM - along with all the necessary data - all the time. It takes more than 100 ms to load your program from disk or even write it from disk. As soon as one of your servers gets to disk, this is the game for your 100 ms delay requirement.

In the general server environment that we are talking about here, given your cost limitations, it will almost certainly be necessary for at least one of your 1,500 servers to go to disk in order to activate your application.

Thus, you will have to pay enough to convince someone that you constantly work in the program and in memory, or you have to cancel your delay requirements.

+1
source

Two trains of thought:

a) if these restrictions are really, absolutely, really based on common sense and are implemented as you propose in the nth edition, it seems that the alleged data is not huge. So, what about a trading repository for calculating time. How big is the table (s)? Terabytes are cheap!

b) This is very similar to the request of the employer / client, which is not justified in common sense. (in my experience)

Suppose that 15 minutes of calculation time on one core. I think this is what you say. For a reasonable amount of money, you can buy a system with 16 own, 32 hypersurface cores and 48 GB of RAM.

This should lead us to a 30 second range. Add a dozen terabytes of memory and some preprocessing. Perhaps there will be a 10-fold increase. 3 sec Too slow for 3 s? If so, why?

+1
source

It looks like you need to use an algorithm like MapReduce: simplified data processing on large clusters

Wiki

0
source

Check out Parallel Computing and related articles in this WikiPedia article. "For programming parallel computers created parallel programming languages, libraries, APIs and parallel programming models" .... http://en.wikipedia.org/wiki/Parallel_computing

0
source

You will find a lot about such issues at

http://highscalability.com/

0
source

Although Cloud Computing is a cool new guy in town, your script sounds more like a cluster , that is, how can I use parallelism to solve the problem in a shorter time. My solution would be:

  • Understand that if you have a problem that can be solved in n steps of time on one processor, this does not guarantee that it can be solved in n / m on m cpus. In fact, n / m is the theoretical lower limit. parallelism usually makes you communicate more, and so you are unlikely to ever reach that limit.
  • Parallelize your serial algorithm, make sure that it is still correct, and you are not getting any race conditions.
  • Find a provider, see what it can offer you in terms of programming languages ​​/ APIs (no experience with this)
0
source

What you are asking for does not exist, for the simple reason that you will need to have 1,500 instances of your application (probably with significant data in memory) inactivity on 1,500 machines — consuming resources on all of them. None of the existing cloud computing offerings on this basis. Platforms such as App Engine and Azure do not give you direct control over the distribution of your application, while platforms such as the Amazon EC2 card per hour per instance will cost you more than $ 2,000 a day.

0
source

All Articles