Localization of a segmentation error for a mutithread program running in a cluster

It is very simple to use gdb to find a segmentation error when starting a simple program in interactive mode. But consider that we have a multi-threaded program written by pthread - passed to the node cluster (using the qsub ). Thus, we do not have an interactive operation.

How can we find a segmentation error? I am looking for a general approach, program or test tool. I can not imagine a reproducible example, because the program is really big and the cluster crashes in some unknown situations.

I need to find a problem in such a difficult situation, because the program works correctly on the local machine with any number of threads.

+7
source share
1 answer

The β€œnormal” approach is for the environment to create and master the main file. If this is not an option, you can try installing a signal handler for SIGSEGV , which gets at least a stack trace somewhere. Of course, this immediately leads to the question "how to get a stack trace" , but this can be answered elsewhere.

The easiest way is probably to get the main file. Assuming you have a similar machine in which you can read the main file, you can use gdb program corefile to debug the program program that created the main corefile : you should be able to view different streams, their data (to some extent) etc. If you do not have a suitable machine, you may need to cross-compile gdb according to the hardware of the machine on which it was running.

I am a little confused by the assertion that the main files are empty: you can set limits for kernel files using ulimit in the shell. If the kernel size is zero, it should not generate a kernel file. Creating an empty one seems strange. However, if you cannot change the restrictions on your program, you can probably install a signal handler and unload the stack trace from the damaging stream.

Thinking about this, you can put the program to sleep in the signal handler and connect to it using the debugger, assuming that you can run the debugger on the appropriate computer. You must determine the process identifier (using, for example, ps -elf | grep program ) and then attach to it using

 gdb program pid 

I'm not sure how to put the program to sleep from the program, though (maybe setting a handler for SIGSTOP for SIGSEGV ...).

However, I assume that you tried to run your program on your local machine ...? Some problems are more fundamental than the need for a distributed system of many threads running on each node. This is obviously not a substitute for the above approach.

+4
source

All Articles