What are common practices for deploying large-scale systems?

Question

What are common practices for deploying large-scale systems?

Given a large-scale software project with several components written in different languages, configuration files, configuration scripts, environment settings, and database migration scripts - what are the general deployment methods for production?

What are the difficulties? Can I simplify the process with tools like Ant or Maven? How can I manage rollback and database management? Is version control appropriate for the production environment?

+4

deployment release-management

parkr Sep 28 '09 at 2:38

source share

3 answers

Disclaimer: where I work, we use what I wrote, what I mentioned

I will tell you how I do it.

For configuration and general deployment of code and content, I use a combination of NAnt, CI server, and dashy (automatic deployment tool). You can replace dashy with any other “thing” that automates the upload to your server (possibly capistrano ).

For database scripts, we use RedGate SQL Compare to get the scripts, and then for most changes, I actually do them manually, where necessary. This is due to the fact that some changes are a bit complicated, and I feel more comfortable doing it manually. You can really use this tool to do it for you (or at least generate scripts).

Depending on your language, there are some tools that a script can update DB for you (I think that someone on this forum wrote one, I hope he answers), but I have no experience with them. This is what I would like to add, though.

Difficulties

I forgot to answer one of your questions.

The biggest problem when updating any substantially complex / distributed site is database synchronization. You need to think about whether you will have downtime, and if so, what will happen to the database. Will you stop everything so that transactions cannot be processed? Or do you switch everything to one server, update DB B and then synchronize DB A and B and then update DB B? Or something else?

No matter what you choose, you need to select it and say “Good for every update, there will be an X idle time” or something else. Just register it.

The worst thing you can do is execute someones transaction, middle processing as you upgrade this server; or somehow leave only part of your system.

+3

Noon silk Sep 28 '09 at 2:49

source share

I don’t think you have the option of using version control or not.

You cannot do without versions (as indicated in the comment).

I speak from experience since I am currently working on a website where we have several elements that should work together:

The software itself, its functionality at a given time
External systems (of which 6) are associated with (version messages)
A database that contains the configuration (and translations)
Static resources (images, css, javascripts) hosted on the Apache server (some actually)
A Java applet that needs to be synchronized with javascript and software.

This works because we use version control, although I must admit that our database is quite simple and because the deployment is automated.

Version means that at any given time we have several parallel versions of messages, a database, static resources, and a java applet.

This is especially important in the case of a "return". If you find a flaw when downloading new software, and suddenly you cannot afford to download it, you will have a crisis if you have not completed the version and you just need to download the old soft one if you have one.

Now, as I already said, the deployment of the scenario: - first static resources are deployed (+ Java applet) - the next database, several configurations cohabit because they are versioned - the soft queue is loaded into the “cool” time window (when our traffic is at the lowest point, so at night)

And, of course, we care about problems with feedback from external servers, which should not be underestimated.

+1

Matthieu M. Oct 2 '09 at 9:28

source share

Alex martelli · Accepted Answer · 2009-10-05T01:18:15+0000

As I see it, you mostly ask about the best practices and tools for developing AKA releases — it’s important to know the term “art” for the subject, since it makes it much easier to find additional information.

A configuration management system (CMS - as a version control system or version control system) is necessary for the development of software today; if you use one or more IDEs, it is also good to have good integration between them and the CMS, although this is a big problem for development purposes than for deploying / releng.

From the point of view of reviewing, the key thing in CMS is that it should have good support for “branching” (under any name), since releases should be made from the “release branch”, where all the developed code and all its dependencies (code and data) are in a stable “snapshot”, from which the exact identical configuration can be reproduced as desired.

The need for good branch support may be more obvious if you need to support multiple branches (configured for different uses, platforms, etc.), but even if your releases are always strictly in the same linear sequence, the creation of a release branch dictates. “Good branching support” includes ease of merging (and “conflict resolution” when different changes are made to the file), “cherry picking” (taking one patch or set of changes from one branch or head / trunk and applying it to another branch) etc.

In practice, you begin the release process by creating a release branch; then you do exhaustive testing on this thread (usually much more than what you run every day in your continuous build, including extensive regression testing, integration testing, stress testing, performance testing, etc. and possibly even more expensive quality assurance processes, depending on). If and when exhaustive testing and QA identify defects in the release candidate (including regressions, performance degradation, etc.), they must be corrected; in a large team, development on the head / body can continue while QA is running, which requires the ease of collecting / merging cherries / etc (whether your practice will perform corrections on the head or in release branches, it still needs to be combined on the other hand; -).

Last but not least, you DO NOT get the full releng value from your CMS unless you somehow track with it “everything” your releases depend on - the easiest way would be to have copies or hard links to all binaries for the tools needed to create the release, etc., but this can often be impractical; therefore, at least keep track of the exact version numbers, versions, bugfix and am of the tools used (operating system, compilers, system libraries, tools that pre-process images, sound or video files in final form, etc. etc. .). The key is able, if necessary, to accurately reproduce the environment necessary to restore the exact version proposed for release (otherwise you will go crazy tracking subtle errors that may depend on changes in third-party tools when changing their versions;).

After CMS, the second most important tool for releng is a good problem tracking system - ideally one that integrates well with CMS. It is also important for the development process (and other aspects of product management), but the issue of problem tracking is important from the point of view of the release process - it is the ability to easily document which errors have been fixed, which features have been added, removed, or changes, and which changes in performance (or other user-observable features) are expected in the new release. For this purpose, the key “best practice” in development is that each set of changes that CMS sends out must be associated with one (or more) problems in the problem tracking system: in the end, there must be some purpose ( fix a bug, change a function, optimize something or some kind of internal refactor, which should be invisible to the software user); likewise, each monitored problem marked as “closed” must be connected to one (or more) changesets (unless the closure “fixes / works as intended”, problems related to errors and am in; third-party components, which were fixed by a third-party vendor are easily handled in a similar way if you can also track all third-party components in the CMS, see above, if you do not, at least be text files in the CMS that document the third-party components and their evolution, see above again, and they need to be changed when some traceable problem on a third-party component closes).

Automation of various regeneration processes (including creation, automatic testing, and deployment tasks) is the third main priority - automated processes are much more productive and repeatable than asking some poor people to manually list the steps (for quite complex tasks, of course, an automation workflow may require "get the person in the cycle"). As you might have guessed, tools like Ant (and SCons , etc. Etc.) may help here, but inevitably (if you fail to get away with very simple and intuitive processes), you can enrich them with ad -hoc scripts & c (some powerful and flexible scripting language such as perl, python, ruby, and c will help). A “workflow mechanism” can also be precious when your release workflow is quite complex (for example, involving individuals or their groups) “disconnecting” from QA compliance, compliance with legislation, compliance with user interface guidelines, etc.).

Some of the other specific problems you ask about vary greatly depending on the specifics of your environment. If you can afford the programmed downtime, your life is relatively simple, even if the game has a large database, since you can work consistently and deterministically: you close the existing system gracefully, ensure that the current database is saved and backed up (reducing rollback, in the hope that this is a very rare case), run one-time scripts to migrate the circuit or other "irreversible" changes in the environment, restart the system again in a mode that is still not available for ychnyh users run one more extensive set of automated tests - and finally, if all went smoothly (including preservation and backup of the database in its new state, if it is necessary), the system is again opened for public use.

If you need to upgrade a live system without downtime, this can range from minor inconveniences to a systematic nightmare. In the best case, transactions are quite short, and the synchronization between the state set by transactions can be delayed a bit without damage ... and you have enough resources (CPU, storage, etc.). In this case, you start two systems in parallel - the old and the new - and just make sure that all new transactions are aimed at the new system, allowing the old ones to complete the old system. A separate task periodically synchronizes "new data in the old system" with the new system, as transactions on the old system are completed. In the end, you can determine that no transactions are performed on the old system, and all the changes that occurred there are synchronized to the new one - and at this time you can finally close the old system. (You should also be prepared for “reverse synchronization,” of course, if you need to roll back the change).

This is the “simple, sweet” extreme for updating a living system; on the other hand, you may find yourself in such an overly harsh situation that you can prove that the task is impossible (you simply cannot logically fulfill all of these requirements with these resources). Long sessions open on an old system that simply cannot be interrupted - limited resources that make it impossible to start two systems in parallel - the basic requirements for real-time synchronous synchronization for each transaction - etc., life may be miserable (and , as I noticed, in a pinch, you can make the stated task absolutely impossible). The two best things you can do about this are: (1) make sure you have a lot of resources (it will also save your skin when some unexpected server hits your stomach ... you will have another one to shoot, to meet an emergency! -); (2) look at this predicament from the very beginning when they initially define the architecture of the entire system (for example: prefer short-term transactions for long-lived sessions that simply cannot be a snapshot, are closed and are easily reloaded from the snapshot, "is one good architectural guide , -).

What are common practices for deploying large-scale systems?

More articles: