I think I have kinky ... I have a WinForms application that crashes regularly every hour or so when working as an x64 process. I suspect this is due to stack corruption and would like to know if anyone has seen a similar problem or has some recommendations for diagnosing and detecting the problem.
There is no visible interface in this program. This is just a message box that sits in the background and acts as a kind of “middleware” between our other client programs and the server.
He dies in different ways on different machines. Sometimes this is the "APPCRASH" dialog box, which reports an error in the ntdll.dll file. Sometimes this is "APPCRASH", which reports our own DLL as the culprit. Sometimes it's just a silent death. Sometimes our raw exception hook hooks an error, sometimes it’s not.
In cases where a Windows error report is launched, I examined memory dumps from several different failure scenarios and each time I found the same managed exception in memory. This is the same exception that I see as an unhandled exception when we register before his death.
I was also lucky (?) Enough that the application crashed when I was actively debugging using Visual Studio, and saw that the same exception was removing the program.
Now here's the kicker. This particular exception was thrown, caught and swallowed in the first few seconds of the program’s life. I checked this with additional trace logging, and I took the sails of the application memory a couple of minutes after starting the application and checked that some kind of exception was still sitting there on the heap. I also run a memory profiler on the application and used this to make sure that no other .NET object was referencing it.
This code is a bit like this (greatly simplified, but supports key points of flow control)
public class AClass { public object FindAThing(string key) { object retVal = null; Collection<Place> places= GetPlaces(); foreach (Place place in places) { try { retval = place.FindThing(key); break; } catch {}
The stack trace that I see, both in the event log and when viewing the heap using windbg, looks something like this.
Company.NotFoundException: Place.FindThing() AClass.FindAThing()
Now ... for me, it smells a bit like stack damage. An exception is thrown and breaks when the application starts. But a pointer to it survives on the stack for an hour or more, like a bullet in the brain, and then suddenly breaks a critical artery, and the application dies in a puddle.
Additional hints:
The code inside 'InternalFetch' uses some marshal. [Alloc / Free] CoTask and pinvoke code. I ran FxCop over looking for portability issues and found nothing.
This specific manifestation of the problem only affects x64 code created in release mode (with code optimization). The code I specified for the Place.Find method reflects the optimized .NET code. Unoptimized code returns the found object as the last statement, not a "throw exception".
We make some COM calls during startup before the above code runs ... and in the scenario in which the problem appears, the first COM call fails. (Exception caught and swallowed). I commented on this particular COM call, and this does not stop the exception sticking out on the heap.
The problem can also affect 32-bit systems, but if so, then the problem does not appear in the same place. I was sent (typical users!) To a few pixels of the screen with the APP CRASH dialog box, but one thing I could see was StackHash_2264 in the error module field.
EDIT:
Breakthrough!
I narrowed down the problem to a specific call to SetTimer . PInvoke looks like this:
[DllImport("user32")] internal static extern IntPtr SetTimer(IntPtr hwnd, IntPtr nIDEvent, int uElapse, TimerProc CB); internal delegate void TimerProc(IntPtr hWnd, uint nMsg, IntPtr nIDEvent, int dwTime);
There is a specific class that starts a timer in its constructor. Any timers installed in front of this object build the job. Any timers installed after this object build the job. Any timer set during this constructor causes the application to crash, most often. (I have a laptop that crashes, possibly in 95% of cases, but my desktop only crashes in 10% of cases).
Whether the interval is set to 1 hour or 1 second does not seem to be different. An application dies when a timer is required - usually by selecting some previously handled exception, as described above. The callback does not actually execute. If I set the same timer in the very next line of managed code after the constructor returns, everything will be fine and happy.
I had a debugger attached when a bad timer was started, and this caused an access violation in "DispatchMessage". A timer callback has never been called. I have included MDAs that are associated with managed callbacks that are garbage collected and this does not work. I examined the objects with sos and verified that the callback still exists in memory and that the address it pointed to is the correct callback function.
If at this moment I run '! analysis -v ', it usually (but not always) reports something in the lines' ERROR_SXS_CORRUPT_ACTIVATION_STACK'
Replacing the call with SetTimer with the Microsoft class 'System.Windows.Forms.Timer' also stops the failure. I used the reflector in the class and I see that it still calls SetTimer, but does not register the procedure. Instead, it has its own window, which receives a callback. This pInvoke definition actually doesn't look right ... it uses 'ints' for eventId, where the MSDN documentation says it should be UIntPtr.
Our own code originally also used "int" for nIDEvent, not IntPtr - I changed it during this investigation - but the crash continued before and after changing this declaration. Therefore, the only real difference that I see is that we are registering a callback, but the Windows class is not.
So ... at this point, I can "fix" the problem by shuffling one specific SetTimer call to a slightly different place. But I still do not quite understand what is especially important for starting a timer inside this constructor, which causes this error. And I would very much like to understand the root cause of this problem.