Nagios is good, well, maybe system testing (Selenium) works regularly.
Edit: Hyperic and Grounding also look interesting.
There is probably a test kit system that can also support pressure testing for you. I can’t remember the name from my head, maybe someone can mention one below.
Other things that I like:
The best motto for infrastructure is always correction, detection, repair. Lift it, go to its root and cure / prevent it if you can.
Since the system exists at many levels, we must test at many levels:
Edit: to have all errors or warnings sent directly to your manager by phone by email. Thus, you can track events in one place.
1) Connection : monitor your Internet connection from the server and from the outside. Write it down somewhere
2) Server : keep track of all the processes you need to make sure they are running and not binding the server. Use an HP server or something similar with a hardware failure notification that it can run from the BIOS level. Notify and register, if any.
3) Software . Identify key software that should always run. Set performance levels, if any, and then monitor them. Nagios should be able to help with this. In windows, it can be a little more. When an exception occurs, you can run a script from it to automatically restart processes. My dream system allows me to interact with servers via SMS if the server sees this as an exception that I should allow, or one that will be executed automatically if I do not cancel by sms. Once..
4) Remote power . Make sure that remote power-reset capabilities are in your hands. You might want to schedule a weekly reboot if you ever use windows for anything.
5) Testing business logic . Run scripts regularly to check your system’s workflow. Selenium can probably achieve this, but I like to record the results to say that it was at this time, and there were errors in these files. If possible, anywhere, the system will test itself through your scripts.
6) Backups . Create a backup that you can install and forget. If you can get something in virtual machines, this will be ideal as you can scale, move or deploy any part of your infrastructure anywhere. I had cases when I moved a dead server to my laptop, and let it run in vmware until I fixed the problem.