What do you do if you cannot solve the problem?

Question

What do you do if you cannot solve the problem?

Have you ever had a mistake in your code, couldn't you solve the problem? I hope I'm not the only one who made this experience ...

There are some classes of errors that are very difficult to track:

Chronicle-related errors (e.g., occurring during interprocess communication)
memory related errors (most of you know suitable examples, I think !!!)
event-related errors (it’s hard to debug since every breakpoint you are working with makes your IDE a target for mouse / mouse focus events).
OS-specific errors
hardware-dependent errors (occurs on release, but not on developer)
...

Honestly, from time to time I can’t fix this error myself ... After debugging for several hours (or sometimes even days), I feel very demoralized.

What are you doing in this situation (without asking others for help, which is not always possible)?

You

use pencil and paper instead of debugger
face for another thing and come back to this error later
...

Please let me know!

+31

debugging

Thomas Koschel Sep 30 '08 at 20:26

source share

32 answers

I once worked for a company that sold a client-server application, which was mainly a means of transferring and synchronizing files. Both the client and server were custom applications that we developed.

We had a constant mistake, which was very difficult to duplicate in the laboratory. Our server could handle a certain number of incoming client connections in each field, so many of our clients will "cluster" several servers together to process large user groups. The end-user data for the cluster was on a file server, which they all shared. In this cluster configuration, an error occurred that occurred during boot, where we would get a low file system error code when calling file sharing with one of the files at the end of the file. No one could make it repeat reliably in the laboratory, and even when they could, they could not narrow down what was happening.

(I forgot the exact error, maybe it was 59 ERROR_UNEXP_NET_ERR or maybe 65 ERROR_NETWORK_ACCESS_DENIED . As far as I remember, not even one of the documented error codes that you should have received from the API that we called, which was usually a lock or unlock in File section).

Since it was about the connection between the server and file storage, and I was a "network transport" guy, I was instructed to take a look at it. Many others looked at him with no luck.

The only thing I had was that I knew where the error was found in the code, but not what to do about it. So I needed to find the root cause. Therefore, I created a suitable hardware environment for its duplication, and I placed a custom assembly of server software that used a section of the code in question.

The toolkit was as follows: I added a test for an unpleasant error code and asked him to call a piece of code to send a UDP packet to a predefined network address when an error occurs. The UDP packet contained a unique string in it.

Then I install the package sniffing tool on the network. (At that time I used Microsoft Network Monitor ). I positioned it where he could “see” this UDP packet when it was sent, as well as all the communication between the cluster servers and the file server.

Most good sniffers have a mode in which you can capture it until you see a certain section of traffic, and then stop. I turned on this mode and set it to search for a UDP packet that will send my code. The goal was to complete the batch capture of all file server traffic right before the error occurred. The most recent network packets to and from the system in which the UDP packet originated are likely to be a big key to what happens.

I set up the stress test configuration and went home for the weekend.

When I returned on Monday, this is my data. Sniffer stopped as expected, after many hours of work, and kept the grip. Having studied the capture, I found that the Server Message Block or SMB (aka CIFS aka SAMBA ) the connection between our server and the file server actually went to the TCP level due to excessive load on the server. Since all Microsoft materials are highly multi-layered, it will leak back through the file sharing stack as an “unexpected” error instead of returning a more understandable error code that said “hey, you lost the connection at the TCP level”.

I did a little work on TCP settings for Windows and looked at the default values for the version of Windows that we used (probably NT 4 in that era) were not too generous. This allowed only a very small number of failures in the TCP connection and the boom, you were dead. Once you lost your connection to the SMB file server, all of your file locks were toasts and could not be restored.

So, I ended up writing an application for a user guide that explained how to change the TCP settings in Windows to make your cluster server more tolerant of high-load situations. And that’s all. The bug fix was a zero code change, but just additional documentation on how to properly configure the OS for use with this product.

What did we learn?

Be prepared to run modified versions of your code to investigate the problem.
Consider using non-traditional tools to solve the problem (sniffers).
Not all bug fixes require code changes.
Sometimes you can diagnose a mistake when you have a beer at home

+16

Tim Farley Sep 30 '08 at 21:01

source share

I do a few different things:

throw away all my assumptions and start from scratch. Remember that there is a mistake, because something that seems right is actually wrong. Even those lines or functions that you are absolutely sure, true, may be incorrect. Until you can convince yourself of the correctness, you cannot assume that everything is correct.
keep typing statements and approving statements to eliminate things and let me reform new assumptions.
execute the code in the debugger if the problem is related to the control flow problem. Do not go to functions. Go up to them and review all the details of their implementation to confirm that they are working correctly. Confirm the arguments and return values.
If a string or function or class is suspicious, but I cannot prove it locally, write a short test case that does what you think the problem is. This may find a problem or give an idea of where to look next.
stop the day. It's amazing what offline processing your brain will do in one night. Often the answer or key understanding appears the next day while I am doing something meaningless, like showering or driving.

+11

mxg Sep 30 '08 at 20:41

source share

Create an automatic way to cause an error. The worst mistake to fix is that it takes time to reproduce.

+9

Robert Sep 30 '08 at 20:35

source share

Quote taken from Cryptonomicon :

“Intuition, like a flash of lightning, lasts only a second. Usually this happens when a person is tormented by a complex decoding, and when fruitless experiments are analyzed in his mind, then suddenly the light breaks through, and everyone finds in a few minutes that the previous days of work could not be detected "

+6

arul Sep 30 '08 at 20:34

source share

I usually ask someone to take a look at the code. While I explain what the code should do, I sometimes see an error as I say.

When the error is complex, I sit and work until I figure it out and solve the problem. It is interesting that there are times when catching a mysterious error is more pleasant than anything that works smoothly. And relief and feeling when the error is resolved, well, not many other things can win (except for the obvious).

+6

petr k. 30 sept '08 at 20:37

source share

If all else fails, do not solve it directly. Rewrite the code for the problem area in a more refactored way.

+3

Brian R. Bondy Sep 30 '08 at 20:34

source share

I definitely had errors that I worked on for 4-5 days before finding a solution. Other errors have been sitting in the error tracker for several months, as I laid them out after a few hours for a long period of time. I think that such a mistake is inevitable in any complex software project.

Some things that work well for me:

binary search through logging program stream
use Trace statements with DbgView to find errors that appear in release mode
find an alternative way to reproduce the error without changing the code
(works against logic, but ...) changes the code so that the error is more easily reproducible (the failure condition is more easily achieved).
sleep on it and try again tomorrow with a fresh pair of eyes :)

The worst mistake, in my opinion, is the concurrency error, which disappears when the log is inserted.

+3

Nick Sep 30 '08 at 20:48

source share

There are many great answers here. One thing that has worked for me in the past is to ask "what can I do to make it completely obvious when this problem arose?".

For example, if the problem is a damaged value in the data structure, try creating a consistency check procedure that you can run periodically. Also consider implementing all access to shared data through a set of functions that record every change.

Or, if the problem is an “accidental” rewriting of memory, use the replacement malloc () / free (), which intercepts the record in the “free” memory (for example, an electric fence or dmalloc).

Someone else mentioned automating the process of starting an error. If you can do this, it will be greeat. Even in a routine that accidentally executes a program, it can help in these cases.

+2

Mark Bessey Sep 30 '08 at 20:47

source share

"What are you doing in this situation (without asking others for help that is not always possible)?

When can’t I ask for help?

There are always others you can turn to for help: your colleagues, your boss, friends here, in Stack Overflow, etc.

Understanding when to seek help should not be demoralized!

+2

Joe Strazzere Sep 30 '08 at 21:01

source share

There are a lot of good tips here.

The one with which I absolutely disagree is the concept of changing the code, hoping it will go away. First, you are probably going to introduce new errors. Seconds, you can easily change everything to hide the error, only so that it reappears with the next patch.

Damaged memory errors will disappear especially badly, as magically as they appear. However, the error with memory corruption has not been fixed, and just non-fatal areas of memory fail.

1) Try using a different debugger. For example, I use WinDbg more and more. When you load the program into the debugger, the memory layout for your application will change a little. Perhaps another debugger leads to the fact that the error manifests itself in a slightly different way.

2) If you resort to changing the code without knowing exactly what the problem is, then if the error disappears, YOU MUST return and understand why the change fixed the error. Otherwise, you are probably just hiding the error.

3) Talk to others about the error, maybe they saw different versions of the same problem (i.e. other ways to recreate it)

4) Logging.

+2

Torlack Sep 30 '08 at 21:03

source share

I had errors that took weeks or months before a solution was found, but ultimately all errors are fixed. Besides the classic methods for tracking errors without a debugger, such as shutting down parts of the system until you get a minimal test case, I used the following methods:

Look for the best debugging tools. A new perspective goes a long way. Xdebug is what I started using in PHP just because of a performance bug that I was not moving forward with.
Learning the technology in which the error is located. This helped debug the add-in add-in. He had random errors that made no sense and that google search queries appeared. Studying the best practices of the Outlook add-in, COM and MAPI programming, we got a clearer idea of what might go wrong and thought about new things in order to try to fix the errors that eventually fixed them.
An attempt to exacerbate the problem. If there is a problem that occurs only occasionally, I will try to find ways to make this happen all the time. This helped to identify errors in web applications under IE, as well as narrow down a steep error in the flash plugin.
When all else failed, I rewrote the subsystem that caused problems from scratch. This may take several days or even weeks, but if you are stuck with a mistake and cannot solve it, and your clients will not answer the question, what else can you do? This does not always fix things, but if it is not, you usually get a clearer picture of what is going wrong.

I noticed several commonality in these errors, which I stuck for several weeks:

It rarely helps to ask outsiders for help, and as a rule, you should not wait until someone else comes to save this day.
Almost always, an error occurs in some third source technology, especially when using obscure details. IE had unpleasant errors when trying to use client certificates. Flash could not cope with randomly generated drawing instructions (some of them were meaningless). Outlook doesn't like it when you try to dynamically change the format of a form from code. These days, I have learned to respect the “comfort zones” of proprietary technology.

+2

Joeri Sebrechts Sep 30 '08 at 21:09

source share

I give him more time. I once had a mistake (in a personal project) that I simply could not understand. I tried every debugging method I could think of, including Google, without success. Six months later, I returned and found an error within an hour or so. It was not something simple (something apparently undocumented was going on deep inside Swing), but I just looked at it the way I did before.

+2

Michael Myers Sep 30 '08 at 21:23

source share

I had this problem before, I believe that everyone has it, I cleaned it earlier, it simply could not be found, but it crashed all the time when I got some kind of error in the code, what I do is just sit down and concentrate on every piece of code until I find it, it’s difficult, and it takes patience, but all this can be done in such a situation.

Hope this helps.

+1

Rayne Sep 30 '08 at 20:30

source share

I honestly can't remember a mistake that I couldn't fix. It can cause a lot of refactoring or it can take some time, but I have never had one that I cannot get rid of. If it takes me more than an hour to keep track of this, then almost always something is really stupid and small, similar to what has passed : what should have been ; etc.

In python, if I use an editor that does not belong to me, or maybe this is another user's code, I use retab! in vim or paste something like paste to check indentation (if I don't have vim).

If this is not a crasher / deal breaker, then I move on and come back with fresh pairs of eyes.

Oh, and you will never have too much registration.

+1

camflan Sep 30 '08 at 20:31

source share

I add as much debugging as possible (write to the log file, message boxes, etc.) and check.

I do not think this is the worst mistake you can find. The worst part is that you cannot reproduce deterministically or in a test environment.

+1

Gabriele D'Antona Sep 30 '08 at 20:34

source share

I also demoralize a little when I can’t solve the problem. Usually, when I hit a wall with an error, I will just notice my findings and stop working on it. I would jump to another error that is easier to solve, and then return to the error. By doing this, I would have a fresh mind and attitude towards solving the problem. Sometimes you may have a tendency to overcomplicate things when you spend too much time making a mistake. Having a gap helps break open the wall.

Rwendi

+1

RWendi Sep 30 '08 at 20:35

source share

Firstly, is it reproducible? This is a HUGE plus, if any. I want mistakes always to be / never happened ... its intermittent, which are difficult.

And it will depend on the problem, but in my store we will usually tag such a problem, believing that 2 heads (or 3 or 4) are better than 1.

Sometimes the error will not even be in the MY code, but it generally exists. There were problems when a third-party library was the culprit, or a specific implementation on a particular platform was the reason - these stinks.

I will use anything and everything to at least track it: debuggers, trace output, whatever.

As a rule, if I can allocate it to a class or module, I will write a test harness to duplicate the real world and try to duplicate it. I usually write my test code, but sometimes there is legacy code (or other developer code) that no longer has tests.

Usually I will talk about the design and the problem, aloud, with the team and the board, something incomprehensible. Often a solution comes to the surface when we talk about it as a group.

What am I doing.

+1

itsmatt Sep 30 '08 at 20:37

source share

I usually try to solve it. But, if this is not possible for reasonable windows of time, I leave it for a while in order to solve the problem when I sleep;) Someday it works ...

+1

f13o Sep 30 '08 at 20:37

source share

I reviewed a request for help on this website called /qaru.site / ... , which I recently visited ...

+1

Adam Bellaire Sep 30 '08 at 20:45

source share

Really? I am doing something in that order.

In the bed
Ask a colleague
Rewrite so that the area is not affected.
Request SO
Get a support ticket with your third-party library developer.

+1

Johnno Nolan Sep 30 '08 at 20:49

source share

This is what I did today ...

I am debugging the HW / SW interaction, and its frequent logging (toolkit) changes or hides the error. Therefore, tests are performed "at speed". I call these bugs "cockroaches" when they run away from any light that I can shine on them.

Therefore, I must:

Find the transaction that causes the error. List the HW interaction through the log (this test passes, but it illustrates the flow).

Tool before and after error to print state changes.

The error that I am solving now is, of course, the worst of all, since the HW is blocked. HW includes a processor, so that it is like being in a well-lit room, then a power failure and its black color.

I have a special backdoor view in memory, but of course it is locked too. I tried cycling in the hope that the memory would remain unstable long enough to use the backdoor again. There is no such luck. .

, , ( , ..). HW, , HW.

, , - .

, , ...

HW SW - , , , , . . ? (, , HW). Nth? Nth (N-1) -. SW , . , .

SW , ? HW . ASIC. , ISA .

, , . .

, , SW HW . , , , HW . . N- , (, ).

, , , . , ;)

, , . () - . .

, . - , . , .

, , , - 1000.

HW , .

, - , .

+1

humble_guru 01 . '08 0:19

source share

, , , Ants Profiler .

0

Mitchel Sellers 30 . '08 20:30

source share

.

, .

.

0

Andreas Petersson 30 . '08 20:31

source share

, , "" , - .

, / , , - , , , .

, , " ", , , - , .

0

Ron Savage 30 . '08 20:32

source share

, - , .

, , , , , - .

, , , , XP Pro, IIS 5.0. , , , .

" O/S", -, IE Firefox , Safari Mac. , CSS, Mac, , , , ? , Linux - Linux-, ?

, , .

0

JB King 30 . '08 20:35

source share

. , , , . , - , , , .

, , , !

0

Kluge 30 . '08 20:35

source share

, . 3 , , . , , , , . . QA, . Over time i

, , ,
, , , .
stdout, , , < kill -3 "
..
, , .

, , , .

0

Paul Tomblin Sep 30 '08 20:41

source share

, , !

. / , . ( - ) !

0

elmarco 30 . '08 20:43

source share

, . , - ( -, ), , . , , , , !

0

Steve 30 . '08 20:46

source share

Matias Nino · Accepted Answer · 2008-09-30 20:30

Some things that help:

1) Take a break, approach the error from a different angle.

2) Get more aggressive with tracking and logging.

3) Look at this a couple more eyes.

four). The usual last resort is to figure out a way to make the mistake inappropriate by changing the basic conditions in which it occurs.

5) Break and break things. (Just to relieve stress!)

What do you do if you cannot solve the problem?

More articles: