Best practices for monitoring web applications

We are finishing our web application and planning to deploy it. A very important aspect of implementation in production is monitoring the status of the system. Having a small development / support team is very important for us to get early notifications of potential problems and resolve them before they affect users.

Using Nagios stitches as a good option, but I would like to get more opinions on what are the best monitoring / practice tools for web applications in general and, in particular, for a Django application? I would also welcome recommendations on what to keep track of, besides the obvious processor, memory, disk space, and database connectivity.

Our web application is written in Django, we work on Linux (Ubuntu) under Apache + Fast CGI with a PostgreSQL database.

EDIT Linup has a fully virtualized environment.

EDIT We use django-logging, so we have a way to share information, errors, critical issues, etc.

+74
django web-applications deployment monitoring
Jan 30 '09 at 15:52
source share
18 answers

Nagios is good, well, maybe system testing (Selenium) works regularly.

Edit: Hyperic and Grounding also look interesting.

There is probably a test kit system that can also support pressure testing for you. I can’t remember the name from my head, maybe someone can mention one below.

Other things that I like:

The best motto for infrastructure is always correction, detection, repair. Lift it, go to its root and cure / prevent it if you can.

Since the system exists at many levels, we must test at many levels:

Edit: to have all errors or warnings sent directly to your manager by phone by email. Thus, you can track events in one place.

1) Connection : monitor your Internet connection from the server and from the outside. Write it down somewhere

2) Server : keep track of all the processes you need to make sure they are running and not binding the server. Use an HP server or something similar with a hardware failure notification that it can run from the BIOS level. Notify and register, if any.

3) Software . Identify key software that should always run. Set performance levels, if any, and then monitor them. Nagios should be able to help with this. In windows, it can be a little more. When an exception occurs, you can run a script from it to automatically restart processes. My dream system allows me to interact with servers via SMS if the server sees this as an exception that I should allow, or one that will be executed automatically if I do not cancel by sms. Once..

4) Remote power . Make sure that remote power-reset capabilities are in your hands. You might want to schedule a weekly reboot if you ever use windows for anything.

5) Testing business logic . Run scripts regularly to check your system’s workflow. Selenium can probably achieve this, but I like to record the results to say that it was at this time, and there were errors in these files. If possible, anywhere, the system will test itself through your scripts.

6) Backups . Create a backup that you can install and forget. If you can get something in virtual machines, this will be ideal as you can scale, move or deploy any part of your infrastructure anywhere. I had cases when I moved a dead server to my laptop, and let it run in vmware until I fixed the problem.

+35
Jan 30 '09 at 16:39
source share

Monitoring the number of connections to your web server and your database is another good thing to keep track of. Most likely, if someone shoots on the roof, something is starving for resources, and the site is about to go down.

Also make sure you have a regular URL request, which is a reasonable end-to-end system test. If your site supports search, try running nagios to make sure the search index is healthy, the web server and the database server.

Also, make sure your applications send you email anytime your users see an error or an unhandled exception exists. This way you know how the application does not work in the field.

+12
Jan 30 '09 at 16:55
source share

If I had to choose one type of testing, this would be testing the functionality of the end user of the system. It is important to consider that the user. When testing such things as database availability, server uptime, etc., all this is important, testing work flows through your system using the remote user interface testing system covers all of these databases. If you know that critical parts of your system are accessible to the end user, then you know that your system is pretty good.

  • Identify important workflows on your system. . For example, if you wrote an e-commerce site, you can define the workflow of “finding a product,” putting the product in the shopping cart and purchasing the product. "
  • Prioritize workflows and first create tests with a higher priority. You can always add additional tests after deployment to production.
  • Build user interfaces using one of the available user interface platforms. There are many free and commercial user interface testing interfaces that can be run automatically. First, create a basic set of tests that address critical workflows.
  • Install at least one remote location to run tests. . You want to test every aspect of your system, which means its remote testing. Is your internet connection connected? Is the web server working? Is the connection to the database server working? Etc etc. If you test remotely, you will make sure that the system is accessible to the outside world, which means that it most likely works from end to end. You can also run these tests internally, but I think it is very important to run them from the outside.
  • Make sure your decision includes both reporting and notification. If one of your critical workflow tests fails, you want someone to find out about it to fix the problem as soon as possible. If a non-critical task fails, you may only need reporting so that you can fix problems out of range.

This end-user testing should not eliminate system monitoring in your data center, but I want to reiterate that end-user testing is the most important type of testing you can do for a web application.

+11
Feb 01 '09 at 20:44
source share

Ahhh, monitoring. How I love you and your vibrations at 3 a.m.

In fact, you need a way to check the internal state of your application both at a certain moment and at intervals (the latter is very important for detecting problems before they occur). Another way to think about it is with renowned unit testing.

We have our own (very nice) monitoring system, so I can not comment on Nagios or other applications. Our use case is similar to yours though (cgi app on apache).

  • Add a method like logging.monitor () that will write information to disk. This should support, at least, the logging of primes and digits of numbers (key association => can be incredibly convenient).
  • Have a process that flushes monitoring logs and stores them in a database.
  • Have a process that takes database information, checks them against the rules, and sends alerts. Keep in mind that something may be flaky. Just because you received 404 times, this does not mean that the application is down.
  • To be able to turn off alerts (very useful for maintenance or for reading your email).

This is all a pretty high standard. The important thing is that over time you have a history of the state of the application. Then you can create rules (maybe only raw sql queries that you put in the configuration somewhere) that say: “If queries double in a second, send a SlashDotted warning” or “if 50% of the responses are 404, send a warning” . He also manages the system because you can quantify any comments about whether up, down, fast or slow.

Everything that needs to be tracked includes (others probably mentioned this as well): http status, available port, HTTP download, database loading, open connection, request timeout, server availability (ssh, ping), number of requests per second , number of work processes, percentage of errors, error rate.

Simple end-to-end tests are also very convenient, although they can be fragile. It’s best to simplify them, but you should have one that tries to touch on the main parts of the application (caching, database, authentication).

+7
Feb 01 '09 at 21:17
source share

I use Munin and Monit , and were very pleased with both of them.

+5
Jan 31 '09 at 2:58
source share

Internal logging is great and dandy, but when your application disappears or your box / enviro crashes, you also need external control. http://www.pingdom.com/ was very reliable for me.

My only advice: I would not spend too much time on this. my best example is twitter, how much energy they put into a system capable of half-worlds, instead of just investing in that time and energy to throw out more hardware / scale it.

Most likely, this will lead you to decline, your journals and health systems will still be skipped.

+4
Feb 04 '09 at 18:59
source share

The single most important way to monitor any website is to monitor from the outside. The goal should be to monitor your site in a way that most accurately reflects how your users use the site. In 99% of cases, once you know that your site is outside, it is relatively easy to find the root cause. The most important thing is to find out as soon as possible that your customers cannot load your site.

This usually means using an external performance monitoring service. They are very from the lowest end (mon.itor.us, pingdom) to the highest level (Webmetrics, Gomez, Keynote). And as always, you get what you pay for. What to look for when you make purchases for a monitoring service, follow these steps:

  • Size and distribution of the monitoring network
  • Regardless of whether monitoring can track your site using a real browser (otherwise you will not test your site as a real user)
  • Scripting language (before script transactions with your site)
  • A support department that will help you along the way and provide expert advice on how to properly control.

Good luck

+4
Feb 08 '09 at 0:38
source share

Web monitoring IP Patrol or SiteSentry were helpful to us. Secondly, it is a bit like a site’s trust, but a little more beautiful.

+3
Aug 14 '11 at 13:16
source share

Have you thought about monitoring functionality? A script (either in a scripting language like Perl or Pyton, or using some tool like WebTest ) that talks to your application and takes some important steps like logging in, buying, etc., very nice.

+2
Jan 30 '09 at 16:39
source share

In addition to monitoring, which has already been answered, you need to make sure that regardless of the system you are using, you will receive only one error notification that occurs several times, for each request, or your mailbox will run out of memory :) Plus it's just annoying ...

Separate fallback shifts between the support / developer team, so one person does not need to call him every night. It will wear people. Monitoring is good, but everyone should get a chance from time to time. Your cell phone buzzing at 2am for a few nights will soon become very old, trust me. And not every developer is used for round-the-clock support, so you need to find a balance between the use of monitoring and the abuse of monitoring.

Basically, there are separate levels of escalation, and if the sky does not fall, define “ serenity at night now ” at night “, where the smaller escalation levels do not go out.

+2
Feb 01 '09 at 20:31
source share

I use Nagios + CruiseControl + Selenium to run high-level tests in mission-critical web applications. I was very burned out due to a simple jquery error that prevented users from processing the online registration form.

http://www.agileatwork.com/the-holy-trinity-of-web-2-0-application-monitoring/

+2
Jul 25 '09 at 21:06
source share

You can look at AlertGrid . This web application allows you to filter and forward alerts to your team (worldwide). It also has a good ability to control if something has not happened.

+2
04 Oct '10 at 21:08
source share

To paraphrase Richard Levasser: ah, monitoring tools, how your imperfections upset me. There seems to be no perfect tool; Nagios is fairly easy to configure, but the user interface is a bit outdated and there should be a daemon on every monitored server. Zenoss has a much nicer interface, including resource usage trend charts, but uses SNMP, so you need to be familiar with this to get it working correctly, and the documentation is not the best - there are hundreds of pages, but it’s very difficult to find only the information you need to get started.

My friends also recommended Cacti and Hyperic , but I have no personal experience with them.

Last: One of the other answers suggested launching a tool that highlights your site. I would not recommend doing this on your site if you do not have a reliable quiet period when no one beats him; even then you can suddenly lower it. It is much better to have an intermediate server where you can run load tests before making changes to the production.

+1
Feb 04 '09 at 18:54
source share

Well, maybe system testing (Selenium) works regularly.

=> 100% ACK. For this we use http://www.alertfox.com . with our PRO2 account, they run a regression test in 1 hour, which is great. You can even do this with your free account, but are limited to only one transaction sensor.

+1
Aug 20 '09 at 10:05
source share

One of our customers uses Techout (www.techout.com) and is very pleased with this service.

There are no restrictions on alerts, no matter what type or how many, and they offer alerts by email, voicemail and SMS messages - and if something happens, call from a living person to help you.

This is all based on the service — you do not install software, and you have a consultant who works with you to determine the best approach for your business. This is one of the most convenient web monitoring applications because they take care of everything.

0
Apr 7 '09 at 16:47
source share

I would simply add that you can predict the probability of an error based on the history of past errors and fixing them. If you plan to reduce the frequency and severity of problems that have been fixed up to this point, you will have an overview of predictable new problems. If for some time everything works without errors, then the last changes or problems with scalability will be two sources of problems.

From the above, it sounds like scalability - this is your only concern, but I just mentioned the test of the frequency of past errors, because the teams on which I constantly thought that they fixed the last error, and they are no more. Not yet.

0
Jun 09 '09 at 17:10
source share

Having changed the line a bit, I really think it is useful and changed a lot how I track my applications in order to log javascript exceptions somewhere. There's a very good implementation that logs this directly from user browsers to Google Analytics. This is necessary for Javascript-oriented web applications and can give you results based directly on user browsers, which can lead to very unexpected errors (iE and mobile browser are a pain)

Disclaimer: my post is below

http://www.directperformance.com.br/en/javascript-debug-simples-com-google-analytics

0
Apr 14 '11 at 12:01
source share

To monitor my online presence, I offer a service I'm working on: Sucuri NBIM (Network Integrity Monitoring).

This is a check of accessibility and integrity, search for changes in your online presence (sites, DNS, WHOIS, headers, etc.) and loss of connection. It is free and you can try it here .

-one
May 13, '09 at 15:10
source share



All Articles