What can change the frame pointer?

Question

What can change the frame pointer?

I have a very strange error now arising in a rather massive C ++ application (massive in terms of CPU and RAM usage, as well as code length - more than 100,000 lines). This works on the Sun Solaris 10 dual-core processor. The program subscribes to stock price feeds and displays them on the "pages" configured by the user (the page is a window construction configured by the user - the program allows the user to configure such pages). This program worked without problems until one of the base libraries became multithreaded. Accordingly, the parts of the program that it affected have changed. To my problem.

After about every three executions, the program will segfault at startup. This is not necessarily a tough rule - sometimes it will break three times in a row and then work five times in a row. This is a secret that is interesting (read: painful). It can appear in several ways, but most often what happens is function A calls function B, and when you enter function B, the frame pointer will suddenly be set to 0x000002. Function A:

result_type emit(typename type_trait<T_arg1>::take _A_a1) const { return emitter_type::emit(impl_, _A_a1); }

This is a simple signal implementation. impl_ and _A_a1 are clearly defined in their box on failure. With the actual execution of this instruction, we end the counter 0x000002.

This does not always happen in this function. In fact, this happens in several places, but this is one of the simplest cases that does not leave so much room for errors. Sometimes it happens that a variable distributed over stacks will suddenly sit on garbage memory (always at 0x000002) for no reason. In other cases, the same code will work fine. So my question is, what could hit the stack so badly? What can actually change the value of a frame pointer? Of course, I never heard of this. The only thing I can think of is that you write the boundaries in an array, but I created it with the help of a stack protector, which should come up with any examples of this event. I am also within my stack. I also do not see how another thread can overwrite a variable in the stack of the first thread, since each thread has its own stack (these are all pthreads). I tried building it on a linux machine, and until I get segfaults there, about one out of three times, it will freeze on me.

+4

c ++ callstack corruption

Peter Wlodarczyk Oct 30 '08 at 10:44

source share

14 answers

Roddy · Answer 1 · 2008-10-30T23:36:39+0000

The stack damage is definitely 99.9%.

Smells that you should carefully study are as follows: -

Using 'C' Arrays
Using strcpy 'C' Style Functions
Tetsru
malloc and free
thread safety with pointers
Uninitialized POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference

Konrad Rudolph · Answer 2 · 2008-10-30T23:02:31+0000

I had this exact problem today, and I was on my knee in the gdb buffer and debugging an hour before it occurred to me that I just wrote the bounds of the array (where I did not expect this) to C array.

Therefore, if possible, use vector instead, because any STL implementation for decoding will give good compiler messages if you try this in debug mode (while C arrays punish you with segfaults).

Michael burr · Answer 3 · 2008-10-30T23:09:10+0000

I'm not sure what you call a "frame pointer" as you say:

With the actual implementation of this instruction, we end the program counter 0x000002

Because of this, it sounds like the return address is corrupted. A frame pointer is a pointer indicating the location in the stack of the current context of the function call. It can point to a return address (this is an implementation detail), but the frame pointer itself is not a return address.

I do not think that there is enough information to really give you a good answer, but some things that can be criminals are as follows:

wrong call. If you call a function using a calling convention other than how the function was compiled, the stack may become corrupted.
RAM Any letter through a bad pointer may cause garbage to be on the stack. I am not familiar with Solaris, but most thread implementations have threads in the same process address space, so any thread can access any other thread stack. One of the ways that a thread can get a pointer to another thread stack is if the address of the local variable is passed to the API, which ultimately deals with a pointer to another thread. if you don’t synchronize everything correctly, this will cause the pointer to access invalid data. Given that you are dealing with a "simple signal implementation", it seems that one thread is sending a signal to another. Maybe one of the parameters in this signal has a pointer to a local one?

Roddy · Answer 4 · 2008-10-31T17:16:21+0000

There is some confusion here between stack overflow and stack corruption.

Stack overflow is a very specific problem if you try to use a larger stack than the operating system allocated for your thread. Three normal causes are as follows.

 void foo() { foo(); // endless recursion - whoops! } void foo2() { char myBuffer[A_VERY_BIG_NUMBER]; // The stack can't hold that much. } class bigObj { char myBuffer[A_VERY_BIG_NUMBER]; } void foo2( bigObj big1) // pass by value of a big object - whoops! { }

On embedded systems, the size of the stream stack can be measured in bytes, and even a simple sequence of calls can cause problems. By default, in windows, each thread receives 1 megabyte of stack, so stack overflow is much less than a common problem. Unless you have infinite recursion, stack overflows can always be reduced by increasing the stack size, although this is usually NOT the best answer.

Damage to the stack simply means writing outside the current stack frame, which potentially distorts other data - or returns addresses on the stack.

It contains the simplest: -

 void foo() { char message[10]; message[10] = '!'; // whoops! beyond end of array }

Jonathan leffler · Answer 5 · 2008-10-30T22:58:51+0000

This seems like a problem - something is being written outside the array and tracing the stack frame (and probably the return address too) on the stack. There is a lot of literature on this subject. The Shell Programmer's Guide (Second Edition) contains SPARC examples that can help you.

postfuturist · Answer 6 · 2008-10-30T23:38:31+0000

With Sililized variables and race conditions, intermittent failures are likely to be suspected.

Zan lynx · Answer 7 · 2008-10-31T00:03:56+0000

Is it possible to run a thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (I actually thought about Thread Checker) also has some very good tools for debugging threads, etc.

If your employer can spring for the cost of more expensive tools, they can really solve these problems much easier.

Richard Harrison · Answer 8 · 2008-10-31T01:30:50+0000

It's not difficult to manipulate the frame pointer - if you look at the subroutine disassembly, you will see that it is pressed at the beginning of the procedure and pulled it to the end - so if something overwrites the stack, it may get lost, the stack pointer is where it is stack, and the frame pointer is where it started (for the current subroutine).

First, I would make sure that all libraries and related objects were restored clean and that all compiler options were consistent. I had a similar problem (Solaris 2.5) caused by an object file that hasn 'been recovered.

It sounds exactly like overwriting - and blocking the memory around the memory won't help if it's just a bad bias.

After each core dump, examine the main file to find out as much as possible about the similarities between the errors. Then try to determine what will be overwritten. As far as I remember, the frame pointer is the last pointer to the stack - so everything is logical before the frame pointer is changed in the current frame of the stack, perhaps write it down and copy it to another place and compare by return.

John · Answer 9 · 2008-10-30T22:59:39+0000

Does it make sense to assign the value 2 to a variable, but instead assigns its address 2?

Other details are lost, but "2" is a recurring theme in the description of the problem .;)

Franci penov · Answer 10 · 2008-10-30T23:06:41+0000

I would say that it definitely sounds like a stack corruption due to the lack of a linked array or write to the buffer. A stack protector would be nice if the record was sequential rather than random.

Steve fallows · Answer 11 · 2008-10-30T23:21:22+0000

Secondly, I believe this is probably a corruption of the stack. I will add that the transition to a multi-threaded library makes me suspect that what happened is a hidden error. The buffer overflow sequence may have occurred in unused memory. Now it gets into another thread. There are many other possible scenarios.

Sorry if this does not give a big hint on how to find it.

Peter Wlodarczyk · Answer 12 · 2008-10-31T00:49:40+0000

I tried Valgrind, but unfortunately it does not detect stack errors:

"In addition to the performance penalty, Valgrind's important limitation is its inability to detect boundary errors when using static or stacked distributed data."

I tend to agree that this is a problem. The hard thing is keeping track of this. As I said, there are more than 100,000 lines of code in this (including our own libraries, developed on our own - some of them go back in 1992), so if someone has good tricks to catch such things, I would appreciate it. Arrays are processed everywhere, and the application uses OI for its graphical interface (if you have not heard about OI, be grateful), so just finding a logical error is a gigantic task, and my short time.

Also agreed that 0x000002 is suspected. This is the only constant between failures. Even stranger is the fact that this only happened with a multi-threaded switch. I think a smaller stack as a result of multithreading is what this crop is doing now, but this is a pure assumption on my part.

No one asked about this, but I built gcc-4.2. In addition, I can guarantee ABI security here, so this is not a problem either. As for the “garbage at the end of the stack” on RAM hit, the fact that it is universally 2 (albeit in different places in the code) makes me doubt that garbage tends to be random.

Msn · Answer 13 · 2008-10-31T20:52:57+0000

Also agreed that 0x000002 is suspected. This is the only constant between failures. Even stranger is the fact that this only happened with a multi-threaded switch. I think a smaller stack as a result of multithreading is what this crop is doing now, but this is a pure assumption on my part.

If you pass something on the stack by reference or address, this will certainly happen if another thread tries to use it after the first thread returned by the function.

You may be able to reproduce this by forcing the application on a single processor. I don’t know how you do it with Sparc.

lothar · Answer 14 · 2009-04-13T01:17:24+0000

Impossible to find out, but here are some tips I can come up with.

In pthreads, you must allocate the stack and pass it to the stream. Have you allocated enough? There is no automatic stack growth, as in one thread process.
If you are sure that you do not damage the stack by writing data about the checked stack data for shotgun pointers (mostly uninitialized pointers).
One of the streams may overwrite some data that others depend on (check data synchronization).
Debugging usually doesn't help much here. I would try to create a lot of log output (traces for entering and exiting from each function / method call), and then analyze the log.
The fact that the error manifests itself differently on Linux can help. What stream mapping do you use on Solaris? Make sure you bind each thread to your own LWP to facilitate debugging.

What can change the frame pointer?

More articles: