Ok, here is the setup: I work at HPC, and we are getting ready to scale to tens of thousands of nodes. To handle this, I applied a local process that caches information on each node to reduce the amount of network traffic. He then provides this information through shared memory. The basic logic is that there is one well-known shared memory block that contains the names of the current cached tables. When an update occurs, the caching tool creates a new shared memory table, populates it, and then updates the known block with the name of the new table.
The code looks like a working find (for example, valgrind says there are no memory leaks), but when I intentionally emphasize the test, the first 783 updates work fine, but on the 784th I get a SIGBUS error when I try to write to the mapped memory.
If there are too many open files in the problem (because I skip file descriptors), I expect shm_open () to fail. If the problem was that I had a memory leak, I would expect mmap () to fail or refuse to report leaks.
Here's a snippet of code. Can anyone suggest a suggestion?
int initialize_paths(writer_t *w, unsigned max_paths) { int err = 0; reader_t *r = &(w->unpublished); close_table(r,PATH_TABLE); w->max_paths = max_paths; err = open_table(r, PATH_TABLE, O_RDWR | O_CREAT, max_paths, 0); return err; } static void close_table(reader_t *r, int table) { if (r->path_table && r->path_table != MAP_FAILED) { munmap(r->path_table,r->path_table->size); r->path_table=NULL; } if (r->path_fd>0) { close(r->path_fd); r->path_fd=0; } } static int open_table(op_ppath_reader_t *r, int table, int rw, unsigned c, unsigned c2) {
UPDATE:
Here is an example of debugging error output:
NOTICE: Pass 783: Inserting records. NOTICE: Creating the path table. TRC: initialize_paths[ TRC: close_table[ TRC: close_table] TRC: open_table[ DBG: h=0x0x2a956b2000, size1=2621536, size2=0
Here is the same result of the previous iteration:
NOTICE: Pass 782: Inserting records. NOTICE: Creating the path table. TRC: initialize_paths[ TRC: close_table[ TRC: close_table] TRC: open_ppath_table[ DBG: h=0x0x2a956b2000, size1=2621536, size2=0 TRC: open_ppath_table] TRC: op_ppath_initialize_paths]
Please note that the address of the pointer is valid as well as the size.
GDB reports the accident in this way:
Program received signal SIGBUS, Bus error. [Switching to Thread 182895447776 (LWP 5328)] 0x00000034a9371d20 in memset () from /lib64/tls/libc.so.6 (gdb) where #0 0x00000034a9371d20 in memset () from /lib64/tls/libc.so.6 #1 0x0000002a955949d0 in open_table (r=0x7fbffff188, table=1, rw=66, c=32768, c2=0) at ofedplus_path_private.c:294 #2 0x0000002a95595280 in initialize_paths (w=0x7fbffff130, max_paths=32768) at path_private.c:567 #3 0x0000000000402050 in server (fname=0x7fbffff270 "gidtable", n=10000) at opp_cache_test.c:202 #4 0x0000000000403086 in main (argc=6, argv=0x7fbffff568) at opp_cache_test.c:542
(Gdb)
Removing memset still calls SIGBUS when h-> size1 is set on the next line and size1 is the first 4 bytes of the displayed area.