Different methodologies for eliminating errors that occur only in production

Question

Different methodologies for eliminating errors that occur only in production

As someone who is relatively unfamiliar with the entire support and bug fix environment and the young programmer, I have never encountered an error that occurs only in the Websphere environment, but not in the localhost testing environment, to this day. When I first received this error report, I was confused by why I could not reproduce it in the localhost test environment. I decided to try the Websphere test environment to find out what would happen, and I successfully reproduced the error. The problem is that I cannot make changes and build a Websphere testing environment. I can only make changes to my local environment. Given this drawback, what methodologies exist to resolve such errors. Or are there any methods at all? Any tips or help on resolving such issues?

+6

debugging testing websphere production

faceless1_14 Jul 27 '09 at 16:24

source share

3 answers

In short, the methodology is to isolate and understand the differences between environments and which ones can cause a problem.

Check your local build. Make sure that the same version (code and database) uses Test and Prod. If so, what are the differences in the environment that may affect the problem you are seeing? (Multi-core, load balancing, operating system version, 3rd version of the library). Do not run locally in the debugger, make sure that you run the release build (if that is what is on Test and Prod), and maybe even perform a local deployment rather than creating the source code.
Check if specific data is causing the problem. If you can, return a copy of the database from Test to Local and see if it can reproduce the problem.
Contact other developers. See if they can reproduce. a problem in their environment. Consult with the guys from QA, think about what can cause such a problem (often they will see “similar” problems and may give you a clue).

At this point, if all else fails, I generally go into a deep state of Zen to try to understand what is missing. But there must be a difference, you just need to find it.

+3

JP Alioto Jul 27 '09 at 16:37

source share

The scientific method is always applied - first check your assumptions. If the systems are different, the problem may be in some implicit default in different ways or in another implementation of some function.

In all debugging processes, localization is the key. You must first isolate the area of the problem. If your OS, patches, libraries, and main software are identical, then these are probably system settings (restrictions for sockets, file descriptors, etc.). If you know that you have enough inodes, left space and memory, then this is not a resource problem. If the computer almost does not respond to your interactive fraud, your load is too high, or you have some runaway processes. Remember that every process should start, and make sure that they get what they need.

This may be code that simply can not cope with the load of the production system. Blocking mechanisms are a very popular cause of problems in production and dev / test systems, simply because you cannot create enough test cases that you get for free in production.

Logging can be easily overlooked, but I also wanted to add a lot of debugging values to the code to make debugging easier. I can’t even calculate how many times a particular environment variable, path or broken symbolic link ruined my day, just to realize that it would be a trivial fix if I look at the values of the variables at run time, and not just at the static code.

If all else fails, ltrace and strace are the best way to really see what happens under the hood. They are not easy to read, but as soon as you get used to how to locate and interpret the return values of some system calls, you get a very powerful debugging tool.

+1

Marcin Jul 27 '09 at 17:26

source share

Jon skeet · Accepted Answer · 2009-07-27T16:29:23+0000

Campaign for full access to the test environment. The ability to change settings, relocate and repeat makes a huge difference. It makes sense to explain how not having access severely limits your ability to do your job.
Make sure you have enough registration and setup. Make sure you keep the logs long enough to track the issue the client reported, even if it happened a few days ago.
When you finally diagnose the problem and why it happens only in a certain environment, write it down and try to convince your local system to behave the same way - this should make it easier to diagnose another symptom of the same problem next time.

Different methodologies for eliminating errors that occur only in production

More articles: