How do you reproduce errors that occur sporadically?

Question

How do you reproduce errors that occur sporadically?

We have an error in our application that does not occur every time, and therefore we do not know its “logic”. I don’t even make it play 100 times today.

Disclaimer: This error exists, and I saw it. This is not pebkac or something like that.

What are common tips for reproducing these kinds of errors?

+61

language-agnostic debugging logging

guerda Mar 25 '10 at 13:33

source share

28 answers

Add some kind of log or trace. For example, register the last X actions that the user performed before raising the error (only if you can set a condition to match the error).

+24

Bryan Denny Mar 25 '10 at 13:36

source share

There is no general good answer to the question, but here is what I found:

This requires talent. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, which has a talent for it, and I hope that you can give them enough candy to make them enthusiastic about helping you, even if this is not their area.
Work back and treat it like scientific research. Start with the error that you see is wrong. Develop hypotheses about what can do this (this is the creative part, an art for which not everyone has talent), and this helps a lot to learn how the code works. For each of these hypotheses (preferably sorted by what you think are most likely), a clean gut is felt again here), develop a test that tries to eliminate it as a cause, and test the hypothesis. Any given failure to comply with the prediction does not mean that the hypothesis is false. Test the hypothesis until it is confirmed to be incorrect (although since it becomes less likely, you can move on to another hypothesis first, just don't reduce it until you get the final rejection).
Collect as much data as possible during this process. Extensive registration and everything else applies. Do not reduce the hypothesis because you are lacking data, but rather eliminate the lack of data. Quite often, the inspiration for the correct hypothesis comes from studying the data. Noticing something in the stack trace, a strange problem in the log, something is missing, what should be in the database, etc.
Double check of each assumption. So many times I saw that the problem was not fixed quickly, because some general method call had not yet been investigated, so the problem was considered inapplicable. "Oh, that should be easy." (See paragraph 1).

If you run out of hypotheses, this is usually due to a lack of knowledge of the system (this is true even if you wrote each line of code yourself), and you need to go through and look at the code and get an additional idea about to come up with a new idea.

Of course, none of the above guarantees anything, but this is the approach I found that gets the results sequentially.

+20

Yishai Mar 25 2018-10-25T00:

source share

Very often, programmers cannot repeat the loss of user experience simply because you have developed a specific workflow and habits in using an application that clearly circumvented the error.

At this 1/100 frequency, I would say that the first thing to do is to handle the exceptions and write anything down anywhere, or you could spend another week looking for this error. Also make a priority list of potentially sensitive articulations and functions in your project. For example: 1 - Multithreading 2 - Wild pointers / free arrays 3 - Reliance on input devices, etc. This will help you segment areas that you can drag with force until they break, as suggested by other posters.

+8

Jelly Amma Mar 25

source share

Since this is an agnostic language, I mentioned several debugging axioms.

Nothing the computer has ever done is random. A "random entry" indicates an unsolved pattern. Debugging begins by highlighting the template. Separate the individual elements and evaluate what makes changes in the behavior of the error.

Different user, same computer? Same user, different computer? Is the occurrence highly periodic? Does periodicity reload?

FYI- I once saw a mistake that one person encountered. I literally mean a person, not a user account. User A will never see the problem in his system, User B will sit on this workstation, signed as user A and can immediately reproduce the error. There should be no conceivable way for an application to know the difference between a physical body in a chair. However -

Users used the application in different ways. User User usually used a hotkey to trigger an action, and User B used on-screen controls. The difference in user behavior is cascaded into a visible error a few actions later.

Any difference affecting the behavior of the error should be investigated, even if it makes no sense.

+7

JSacksteder Mar 25 '10 at 16:35

source share

It is very likely that your application will be MTWIDNTBMT (Multi Threaded when you do not need to be multithreaded), or maybe just multithreaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to color this code (C #):

Random rnd = new Random(); System.Threading.Thread.Sleep(rnd.Next(2000));

and / or this:

 for (int i = 0; i < 4000000000; i++) { // tight loop }

to simulate threads performing their tasks at different times than usual, or to bind the processor to long segments.

Over the years, I have inherited a lot of buggies, multi-threaded applications, and code like the above examples usually leads to sporadic errors occurring much more often.

+5

MusiGenesis Mar 25 '10 at 14:44

source share

Add verbose logging. Understanding the scenario will require several - sometimes a dozen (s) - iterations to add sufficient logging. Now the problem is that if the problem is a condition of the race, which, most likely, will not be reliably reproduced, so registration can change the time, and the problem will stop happening. In this case, do not register in the file, but keep the rotating log buffer in memory and only delete it to disk when you discover that a problem has occurred.

Edit: a little more thoughts: if this is a test of launching a gui application using the qa automation tool that allows macros to be played. If it’s a service type application, try to come up with at least an assumption about what’s going on, and then programmatically create “freak” patterns that will use the code that you suspect. Create higher loads than usual, etc.

+4

MK. Mar 25 '10 at 13:39

source share

What is the development environment? For C ++, recording / playing back a VMWare workstation might be the best choice, see http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html

Other suggestions include checking the stack trace and a thorough review of the code ... actually there is no silver bullet :)

+3

Virgil Mar 25 '10 at 13:37

source share

Try adding a code to the application to automatically track an error after it occurs (or even notify you by mail or SMS).

write down what you can do when this happens, you can catch the correct state of the system.

Another thing is to try automated testing, which can cover more territory than human form testing. This is a long shot, but good practice overall.

+2

Dani Mar 25 '10 at 13:36

source share

all of the above, plus throw some rough power robot robot on it, which is half-sized, and scater approve / verify a lot (c / C ++, probably similar in other languages) via code

+2

SteelBytes Mar 25 '10 at 13:39

source share

Tons of logging and a thorough review of the code are your only options.

This can be especially painful if the application is deployed and you cannot configure logging. At this point, your only choice is to go through a code with a thin tooth comb and try to understand how the program can get into a bad state (scientific method of salvation!)

+2

drewh Mar 25

source share

Often such errors are associated with damaged memory, and for this reason they may not appear very often. You should try to run your software with some kind of memory profiler like valgrind to make sure something went wrong.

+2

Tuomas Pelkonen Mar 25

source share

Along with great patience, quiet prayer and scourge, you will need:

A good mechanism for registering user actions.
A good mechanism for collecting data state when the user performs some actions (state in the application, database, etc.).
Check the server environment (for example, antivirus software running at a specific time, etc.) and record the time of the error and see if you can find any trends.
a few more prayers and curses ...

NTN.

+2

Sunny Mar 25 '10 at 14:13

source share

Suppose Im starts with a production application.

I usually add a debug log around areas where I think an error occurs. I am setting up registration instructions to give me an idea of the state of the application. Then I turned on the debug log level and ask the user / operator (s) to inform me about the next time the error occurred. Then I analyze the log to find out what tips it gives about the state of the application, and if this leads to a better understanding of what might go wrong.
I repeat step 1 until I have a good idea where I can start debugging the code in the debugger
Sometimes the number of iterations of the executed code is key, but in other cases it can be the interaction of the component with an external system (database, specific user machine, operating system, etc.). Take some time to set up a debugging environment that matches the production environment as close as possible. VM technology is a good tool to solve this problem.
Next, I go through the debugger. This may be due to the creation of a test harness that puts the code / components in the Ive state observed from the logs. Knowing how to set conditional breakpoints can save a lot of time, so check out this and other features of your debugger.
Debugging, debugging, debugging. If you don’t go anywhere in a few hours, take a break and do some work on something unrelated. Come back with a new mind and perspective.
If you still haven't gotten anywhere, go back to step 1 and do another iteration.
For really difficult problems, you may have to resort to installing a debugger on the system where the error occurs. This, combined with the test team from step 4, can usually crack really obscure problems.

+2

kragan Mar 25 '10 at 14:16

source share

Unit tests. Testing errors in an application is often terrifying because there is as much noise as there are variable factors. In general, the more (hay) the stack, the more difficult it is to identify the problem. By creatively expanding the unit test framework to cover edge edges, you can save hours or even sifting days

Having said that there is no silver bullet. I feel your pain.

+2

plodder Mar 25 '10 at 16:35

source share

Add a check before and after the condition in the methods associated with this error.

You can take a look at Design by Contract

+2

Pierre-Jean Coudert Mar 25 '10 at 16:50

source share

Assuming you are on Windows and your “mistake” is a glitch or some corruption in unmanaged code (C / C ++), then look at Microsoft's Application Verifier . The tool has several stops that you can enable to check things at runtime. If you have an idea for a script where your error occurs, try running the script (or the stress version of the script) when running AppVerifer. Be sure to enable the iterator in AppVerifier or consider compiling the code with the / RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).

+2

nithins Mar 26 '10 at 0:48

source share

Read the stack trace and try to guess what could happen; then try tracing \ log every line of code that could potentially cause problems.

Make sure to manage resources; many hidden sporadic errors that I discovered were related to close \ dispose things :).

+1

systempuntoout Mar 25 '10 at 13:37

source share

" Heisenbugs " require great diagnostic skills, and if you want to help people here, you need to describe it in much more detail, and patiently listen to various tests and checks, report the results here and repeat this until you solve it (or decide it's too expensive in terms of resources).

You will probably have to tell us about your real situation, language, database, operating system, workload assessment, time of day in which this happened in the past, and many other things, about the tests that you have already done, about how they went, and be prepared to do more and share the results.

And this does not guarantee that we can collectively find it, either ...

+1

p.marino Mar 25 '10 at 13:38

source share

I would suggest recording everything the user did. If you allow me to say 10 such error reports, you can try to find something that connects them.

+1

Tomek Tarczynski Mar 25

source share

For .NET projects, you can use Elmah (error logging modules and error handlers) to track your application for exceptions, it is very easy to install and provides a very good interface for viewing unknown errors.

http://code.google.com/p/elmah/

It only saved me today when you discovered a very random error that occurred during the registration process.

In addition, I can only recommend that as much as possible get as much information from your users and have a complete understanding of the project workflow

They mostly go out at night ... mostly

+1

Nick Allen Mar 25 '10 at 14:29

source share

The team I work with has assured users that they record their time they spend in our application with CamStudio when we have a terrible bug for tracking. It is easy to install and use, and greatly facilitates the reproduction of these errors, as you can observe what users are doing. It also has nothing to do with the language you work in, since it just burns the Windows desktop.

However, this route seems viable only if you are developing enterprise applications and have a good relationship with your users.

+1

Kevin Brill Mar 25 '10 at 22:50

source share

It changes (as you say), but some of the things that may be useful with this may be

immediately goes to the debugger when a problem occurs, and flushes all threads (or the equivalent, for example, flushes the kernel immediately or something else).
It works with logging enabled, but otherwise is completely in release / production mode. (This is possible in some random environments such as c and rails, but not many others).
make stuff to make the edge conditions worse on the machine ... set low memory / high load / more threads / submit more requests
Make sure that you are really listening to what the users who are experiencing the problem actually say. Making sure they really explain the relevant details. Looks like this is the one that tears people apart in the field. Trying to reproduce the wrong problem is boring.
Used to read an assembly that was created by optimizing compilers. This seems to sometimes stop people and it doesn’t apply to all languages / platforms, but it can help
Be prepared to admit that this is your (developer) mistake. Do not fall into the trap of insisting that the code is perfect.
Sometimes you need to actually track the problem on the machine on which this is happening.

+1

corprew Mar 25 '10 at 23:16

source share

@ p.marino - not enough comments for comments = /

tl; dr - build failures due to time of day

You mentioned the time of day, and it caught my attention. If one day someone stayed at work at night, tried to build and commit before they left, and continued to fail. They finally gave up and went home. When they caught the next morning, they built the penalty they committed (probably should have been more suspicious =]), and the assembly worked for everyone. After a week or two, someone lingered and unexpectedly failed. It turns out that an error occurred in the code that made any assembly after breaking 7PM>. >

We also found an error in one rarely used corner of the project this January, which caused problems with sorting between different schemes, because we did not take into account different calendars based on 0 and 1 month. Therefore, if no one had mixed up this part of the project, we would not have found an error before jan. 2011

It was easier to fix than threading problems, but still wondering what I think.

+1

Windle Mar 26 '10 at 2:41

source share

hire some testers!

+1

yamspog Mar 27 '10 at 18:44

source share

This worked for really weird heisenbugs. (I would also recommend getting a copy of Dave Argans's Debugging, these ideas are partly derived from his ideas!)

(0) Check the system RAM using something like Memtest86!

The whole system detects a problem, so create a test fixture that does all this. Say that this is a server thing with a graphical interface, you start it all with the help of a graphical GUI that makes the necessary input to provoke a problem.

This will not fail for 100% of the time, so you will have to endure it more often.

Start by cutting the system in half (binary chop). In the worst case scenario, you need to remove the subsystems one at a time. drown them out if they cannot be commented on.

Look, it still fails. Does this happen more often?

Keep the correct test records and change only one variable at a time!

In the worst case scenario, you use jig and you test for several weeks to get meaningful statistics. It's difficult; but remember that jig does the job.

I have no threads and only one process and I am not talking to equipment

If the system does not have threads, there are no communication processes and contacts; there is no hardware; it is difficult; heisenbugs are usually synchronized, but in the case without threads there are no processes, most likely it is uninitialized data or data used after release, either on the heap or on the stack. Try using checker like valgrind.

For problems with multi-threaded / multi-processor processes:

Try running it on a different number of processors. If it works on 1, try 4! Try setting the 4-computer system to 1. This basically ensures that everything happens in turn.

If there are threads or messaging processes, this can get rid of errors.

If this does not help, but you suspect that it is synchronization or threads, try resizing the OS timeout. Do it as good as your OS provider allows! Sometimes this led to the fact that racing conditions happened almost every time!

Finally, try to slow down on timelists.

Then you install a test joystick that works with debugger (s) attached everywhere and wait for the test clip to stop by mistake.

If all else fails, put the equipment in the freezer and run it there. The timing of everything will be shifted.

+1

Tim Williscroft Apr 09 '10 at 4:13

source share

Debugging is complex and time consuming, especially if you cannot deterministically reproduce the problem. My advice to you is to figure out the steps to reproduce it deterministically (and not just sometimes).

Much research has been done in the field of reproduction of failures in recent years and is still very active. Record-based reproduction methods have been (so far) focused on research by most researchers. This is what you need to do:

1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what aspects can your application take through different execution paths (for example, user input, OS signals)

2) Write them down the next time you run the application

3) When your application does not work again, you have steps to reproduce the failure in your log.

If your log still fails, then you are dealing with a concurrency error. In this case, you should see how your application accesses shared variables. Do not try to record calls to shared variables, because you are recording too much data, thereby causing serious slowdowns and large logs. Unfortunately, I cannot say that this will help you reproduce concurrency errors, because research still has a long way to go in this matter. The best thing I can do is give a link to the very last step (so far) in the topic of deterministic concurrency error playback:

http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html

Regards

+1

João Matos Sep 15 '15 at 16:14

source share

Use the advanced emergency reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose a core dump. You are looking for a stack trace that will show you where it exploded, how it got there, what in memory, etc. It is also useful to have a screenshot of the application if this is user interaction. And information about the machine on which it crashed (OS version and fix, what else works at that time, etc.). Both of these tools can do this.

If this is something happening to several users, but you cannot play it, and they can, sit with them and watch. If this is not obvious, switch places - you are "driving" and they will tell you what to do. This way you discover subtle usability issues. double-click on the button with one click, for example, initiate re-inclusion in the OnClick event. Something like that. If users are deleted, use WebEx, Wink, etc. to record their failure so that you can analyze the playback.

0

Chris Thornton Apr 15 2018-10-15T00:

source share

krosenvold · Accepted Answer · 2010-03-25 13:47

Analyze the problem in pair and read the code with the password. Take notes on the problems that you KNOW to be true, and try to state which logical premises must be true for this. Follow evidence similar to CSI.

Most people instinctively say “add more logging,” and this may be the solution. But for many problems, this only exacerbates the situation, since logging can significantly change the time dependencies to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.

So, if your logical considerations do not solve the problem, this will probably give you some specific features that you could explore with logging or statements in your code.

How do you reproduce errors that occur sporadically?

More articles: