Ours is an embedded PowerPC-based system running Linux. We encounter a random SIGILL crash that occurs for a wide variety of applications. The root cause of the failure is the nulling of the command to be executed. This indicates damage to the text segment in memory. Since the text segment is read-only, the application cannot corrupt it. Therefore, I suspect that some kind of common subsystem (DMA?) Is causing this corruption. Since the problem requires several days to reproduce (a crash due to SIGILL), it is difficult to investigate. Therefore, first of all, I want to know if and when the text segment of any application was damaged. I looked at the stack trace and all pointers, registers are correct.
Do you have any suggestions on how I can do this?
Some information:
Linux 3.12.19-rt30 # 1 SMP Fri Mar 11, 01:31:24 IST 2016 ppc64 GNU / Linux
(gdb) bt
0 0x10457dc0 at xxx
Dismantling output:
=> 0x10457dc0 <+80>: mr r1, r11
0x10457dc4 <+84>: blr
Instruction expected at 0x10457dc0: 0x7d615b78
Instructions found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) technical support sections
0x10006c60-> 0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY HAS_CONTENTS CODE
Expected (from a binary application):
(gdb) x / 32 0x10457da0
0x10457da0: 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0: 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0: 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0: 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after processing SIGILL and dropping neighboring memory locations):
Error Address: 0x10457dc0
0x10457da0: 0x913E0000
0x10457db0: 0x83ABFFF4
=> 0x10457dc0: 0x00000000
0x10457dd0: 0x4BCD8BE5
0x10457de0: 0x93E1000C
Edit:
One of us is that corruption always happens with an offset ending in 0xdc0.
For example, Error address: 0x10653dc0 <<printed by our application after catching SIGILL
Error address: 0x1000ddc0 <<printed by our application after catching SIGILL
flash_erase [8557]: raw signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite [8561]: raw signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk [4448]: raw signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk [16002]: raw signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats [20670]: raw signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr [27923]: raw signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another conclusion is that damage always occurs when the physical number of frames is 0x00a4d. I believe that with PAGE_SIZE of 4096 this translates to the physical address 0x00A4DDC0. We suspect a couple of our kernel drivers and investigate further. Is there any better idea (like setting up a monitoring point for equipment) that might be more effective? How about KASAN as suggested below?
Any help is appreciated. Thanks.
c ++ c linux linux-kernel memory-corruption
Nikhil Utane
source share