Our team has a number of processes that we start manually, but which can work for many days. Processes do different things for a large number of objects (web pages, database rows, images, files, etc.). Obviously, there are crashes from time to time, and we must design or process to handle these crashes gracefully and move on, so all the work is not interrupted.
Depending on the particular process under consideration, the speed, severity and urgency of the failures vary. In some cases, we send emails when a rare but important error occurs, in other cases we just register it and move on, etc.
The problem is that we have different error handling codes scattered all over the place, and more often than not, when we “register it and go over”, no one comes back and reads the logs, so no one knows what problems have occurred. We cannot email all problems by default, because there will simply be too many emails.
These are long processes, but not daemons, where something like SNMP or Nagios might seem good. Of course, this is a fairly common problem, but I can not find many solutions on the Internet. I heard people talking about using log4j (or other similar journal packages) to enter the database, etc., which seems like it could be a step in the right direction, but, of course, there are more complex solutions now. ? I imagine something where your log writes events to a database, and there is a web interface similar to Nagios, which allows you to see what errors occur with processes in real time, and also set up email alerts for specific templates, etc. .d.
Is there something similar? If not, what approaches did you use to successfully solve such problems?
( python, , , , , ).
: , Chainsaw, , , , webapp .
: hoptoadapp exceptional, , , Rails .