Azure minimum downtime

Today, we have a very serious unplanned downtime for our Azure application for what is now arriving up to 9 hours. We reported support for Azure, and the ops team is actively trying to solve the problem, and I have no doubt about it. We managed to run our application on another โ€œtestโ€ hosting service, and we redirected our CNAME to point to the instance so that our customers were satisfied, but the โ€œmainโ€ hosted service is still unavailable.

My own โ€œfinger in the airโ€ instinct is that the problem is with the network in our data center (Western Europe), and, indeed, later that day, when the service panel panel disappeared for this region, with the message this effect . (Our application shows as โ€œHealthyโ€ on the portal, but is not accessible through our cloudapp.net URL. In addition, threads in our application register SQL connection exceptions in our storage account, because they cannot contact the database)

However, it is very strange that the "test" instance that I mentioned above is also located in the same data center and does not have problems with contacting the database, and its external endpoint is fully accessible.

I would like to ask the community if there is anything that I could do better to avoid this downtime? I submitted to the leadership regarding the presence of at least 2 roles per role, but still I burned out. Should I go to a more reliable data center? Should I deploy my application to multiple data centers? How can I control that my SQL-Azure database is in the same data center?

Any constructive guidance would be appreciated - as a technician, I never had a more unpleasant day that could do nothing to help solve the problem.

+4
source share
4 answers

Today in the European data center there was a failure regarding SQL Azure. Some of our customers managed to get to another data center.

If you run critical applications that may not be available, I would deploy the application in several regions. DNS resolution is obviously a weak point right now in Azure, but it can be circumvented (if you run only a website, it can be done very simply using Response.Redirects or similar)

Microsoft now has a data synchronization service that will synchronize multiple Azure SQL databases. Check here . Thus, you can have mirror sites in different regions and synchronize them with the perspective of SQL Azure

It is also recommended that you use a third-party monitoring service that will detect problems with your deployed instances from the outside. AzureWatch can notify or even deploy new sites if you want when "Unresponsive" appears in some instances.

Hope this helps

+7
source

I can offer some recommendations based on our experience:

  • Host your application in multiple data centers, complete with Azure Sql databases. You can connect each application to your Sql server with a specific data center. You can also cache any external assets (images / JS / CSS) on a specific Windows Azure data processing machine or use Azure Blog Storage. Note: additional costs will be incurred.
  • Set up one-way SQL replication between your primary Azure DB Sql and an instance in another data center. If you want to perform two-way replication, see the manual for the MSDN site.
  • Use Azure Traffic Manager to route traffic to the data center closest to the user. It has geo-discovery capabilities that also improve the latency of your application. Thus, you can redirect the http://myapp.com map to the internal URL of your data center, and a user in Europe should automatically be redirected to the European data center and vice versa, for the USA. Note. At the time of writing this message, there is no way to automatically detect and recover from a failure in the data center. Manual steps will be involved as soon as failure detection is detected, and recovery after failure is a complete set (that is, you will refuse both instances of Windows Azure AND Sql Azure). If you want fault tolerance at the micro level, I suggest putting your entire configuration in a service configuration file and encrypting the values โ€‹โ€‹so that you can edit the connection string to connect instance X to DB Y.
  • You are all set now. I would create or install a local application to detect site availability. The best solution would be to create a page to check the availability of application components by writing a diagnostic page or web service and then polling it from the local computer.

NTN

+1
source

As you deploy to Azure, you do not have much control over the configuration of your SQL server. MS has already configured it so that it is available.

Having said that, it looks like MS has been facing some problems with SQL Azure over the past few days. We were told that only a small number of users were affected. At some point in the service panel 5 data centers were affected. I had 3 databases in one of these data centers twice for about an hour each time, but one database in another affected data center that did not interrupt.

If connecting to a database is critical for your application, then the only way in the Azure environment is to protect against problems that MS did not prepare for (this latest technical issue, earthquakes, meteor strikes) would be to search your sql data together in another date center. At the moment, the most practical way to do this is to use a synchronization structure . It is possible to copy Azure SQL databases , but this only works in the data center. If your data is in another place, you can specify your application in a new database if the main one becomes unavailable.

Although it looks good on paper, it may not have helped you with the most recent issue, as it has affected several data centers. If you just made copies of the database on a regular basis, this might be enough for you. Or not.

(I would post this answer on a server error, but I could not find the question)

0
source

This is a programming / architecture issue, but you also want to ask a question at webmasters.stackexchange.com

You need to find out the root cause before making any conclusions.

But. my hunch one of two things was a problem

  • Internet connectivity options vary for the test system and your production system. Either they use different Internet providers, or different lines from the same Internet provider. When I worked at a hosting company, we made sure that ou the IP connection went through at least two different ISPS that didnโ€™t share the fibers with our rooms (and wherever we could, they had different physical routes to the building - the homing ability of excavators when a critical portion of the fiber for digging is well proven there.

  • Your data center is having a problem with some common manufacturing infrastructure. These can be border routers, firewalls, load balancing systems, intrusion detection systems, traffic shapers, etc. Usually they are often installed only in production systems. Protections here include understanding the architecture and ensuring that the provider has a (proven!) DR plan for restoring SOME services when everything is paired up. The numb hack I saw here convinced IPS (Intrusion Prevention System) that its own management servers were malicious. And therefore, you cannot reconfigure it at all.

Just a thought - your DC does not fit any Wikileaks or Paypal / Mastercard / Amazon mirrors (who currently gets DDOS'd by wikileaks supporters)?

-1
source

All Articles