How to stop bezerk exception warnings

Let's say you have a .NET system that should send email notifications to the system administrator when an error occurs. Example:

try { //do something mission critical } catch(Exception ex) { //send ex to the system administrator //give the customer a user-friendly explanation } 

This block of code is called hundreds of times per second by different users.

Now let's say that the base API / service / database is omitted. This code will fail many, many times. The poor administrator is about to wake up to several million emails in his inbox, and the developer is about to receive a rude phone call, and not that such an incident (cough) necessarily happened this morning.

It's pretty clear that this is not a design that scales well.

The first few decisions that come to mind are all spoiled:

  • The error log in the database, and then output a high error count through an HTTP health check to an external monitoring service such as Pingdom . (My favorite candidate is still. But what if the database goes down?)
  • Have a static cache that tracks the latest exceptions, and the warning system always checks for duplicates. (It seems unnecessarily complicated, and secondly, many error messages differ very little - for example, if there is a time stamp in the error, this is useless.)
  • Programmatically disable our system after certain errors or based on constant monitoring of critical dependencies (risky! What if there is a transient false positive?)
  • Just do not report these errors and rely on another part of the system to track and report dependencies. (Does not cater for "unexpected" errors that we did not expect.)

This is like a problem that needs to be solved, and that we are doing it is stupid. Suggestions are evaluated even if they are associated with a completely different exception management strategy!

+7
design-patterns exception-handling error-handling alerts
source share
5 answers

The simplest solution that comes to mind is to assign an identification number to this exception block (for example, 1) and record the time of the last notification to the administrator. If the elapsed time between notifications is not large enough (say, an hour), do not inform the administrator again

if this part of the code usually generates more than one type of exception, you can also register an exception class; if the elapsed time between notifications for the same exception is not large enough, do not inform the administrator again

+5
source share

Check for similarities (timestamps can be dodged with wildcards (??: ?? for example)) and first let them be sent to you over a period of time. Now check what happened most.

Say there are 1000 exceptions of type A, 964 of type B, 120 C, and 7 types of DH.

This means send an email to sysadmin every 100th exception of types A and B, every 10th exception of type C and any other excpetion as it occurs.

Pro:
+ Accurate
+ Prevents system spam
+ Not much code to implement

Con:
- It takes time to develop reliable statistics
- Important exceptions can be ignored accidentally - relies on people who are likely to always fail

+1
source share

I created control applications that administrators wrote before, and I self-consciously admit that I was in your situation. The solution is to restrict your emails. Save the time of the last email address sent and create a check to see if the minimum amount of time has passed since the last email before sending (say, 10 minutes or longer) to you. Thus, the maximum number of letters your poor admin will receive will be <time issue has been going on> / <period> . In my previous sysadmin work, this balanced our need to know that the problem is still ongoing, and it is necessary that the mailbox does not break with 1000 emails per hour.

0
source share

We have something similar in one of our remote applications. It sends an intermediary mailbox with all exceptions, and the script runs every hour, which scans the mail, and creates a summary email that is sent to our mailbox (maximum 24 letters per day), and also saves the rest of the data for the local database for reference in future.

Its not bullet proof, but its pretty quick / easy to set up.

0
source share

I know that this has already been answered, but I feel that it is useful for publication.

Microsoft has added a wealth of information about cloud-based design patterns and architecture, ranging from things like microservices and service buses with message queues to smaller details. All this on the Microsoft Docs website, presented under Azure Architecture . The specific pattern that deals with such a problem is the circuit breaker pattern .

Using this template does not completely solve the problem; there is still the problem of β€œhow do we solve the time to report operating people”? One possible solution is to turn off the circuit breaker and increase the internal counter to create a unique trip identifier (or something similar). Subsequent notifications can then use this identifier. This is just an example - there may be other ways you could reasonably do this. The fact is that I use a circuit breaker to process the logic of the solution, placing it anywhere you need to have its services, and just connect something to it to provide the services that you describe about notifications. At least you can avoid sending a flood of email.

0
source share

All Articles