Suppose you cannot fix your code ... And let all crazy options, such as attaching gdb through a script or so, be ignored.
You can either check the CPU usage (most random endless cycles that I used 100% of the CPU for an hour :)), or (a more likely option) use strace to check what the software is doing right now and implement its own signature trace (if these 20 APIs are repeated 20 times, allow an infinite loop or so).
For instance:
As for automatic recognition of the system ... It seems normal that the program crashes after calling things in a loop uncontrollably (for example, malloc() until you run out of memory, open the files ...), but I (and I will correct me in the comment, if I am mistaken) I have never seen the system (kernel) restart the application. I think you have:
- there are conditions (signal processing, whatever) inside the program that helps to restore
- you have a watchdog timer (check every 20 seconds that
<pid> running and if a new instance is not starting) - you run a distribution that provides service / program configuration by restarting if stopped
But I really doubt that Linux will be so good for your application on it.
source share