Problem: Segmentation Error (SIGSEGV, Signal 11)
Short description of the program:
- high-performance gpu server (CUDA), processing requests from a remote client server
- Each incoming request generates a stream that performs calculations on several GPUs (serial, not parallel) and sends a result to the client, usually it takes from 10 to 200 ms, since each request consists of tens or hundreds of kernel calls.
- Query handler requests have exclusive access to GPUs, which means that if one thread starts something on GPU1, everyone else will have to wait until it runs out.
- compiled with -arch = sm_35 -code = compute_35
- using CUDA 5.0
- I do not use CUDA atomic atoms explicitly or any intranuclear synchronization barriers, although I use traction (various functions) and cudaDeviceSynchronize (), obviously
- Nvidia driver: NVIDIA dlloader X driver 313.30 Wed Mar 27 15:33:21 PDT 2013
OS and HW Information:
- Linux lub1 3.5.0-23-generi # 35 ~ exact1-Ubuntu x86_64 x86_64 x86_64 GNU / Linux
- GPUs: 4x GPU 0: GeForce GTX TITAN
- 32 GB RAM
- MB: ASUS MAXIMUS V EXTREME
- CPU: i7-3770K
Failure Information:
The failure occurs "accidentally" after processing several thousand requests (sometimes earlier, sometimes later). Stack traces from some crashes are as follows:
#0 0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62
As you can see, it usually ends in __pthread_getspecific , called from libcuda.so or somewhere in the library itself. As far as I remember, there was only one case when it did not work, but instead it hung in a strange way: the program was able to answer my requests if they did not involve GPU calculations (statistics, etc.), but otherwise I never did not receive a response. Also, running nvidia-smi -L did not work, it just hung there until I rebooted the computer. Looked at me as a dead end GPU. This may be a completely different issue than this.
Does anyone know where a problem might occur or what might cause it?
Update:
Some additional analysis:
cuda-memcheck does not display any error messages.valgrind - leak check prints quite a lot of messages, for example below (the same hundreds):
==2464== 16 bytes in 1 blocks are definitely lost in loss record 6 of 725 ==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x56B859D: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5050C82: __nptl_deallocate_tsd (pthread_create.c:156) ==2464== by 0x5050EA7: start_thread (pthread_create.c:315) ==2464== by 0x6DDBCBC: clone (clone.S:112) ==2464== ==2464== 16 bytes in 1 blocks are definitely lost in loss record 7 of 725 ==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x56B86D8: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5677E0F: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x400F90D: _dl_fini (dl-fini.c:254) ==2464== by 0x6D23900: __run_exit_handlers (exit.c:78) ==2464== by 0x6D23984: exit (exit.c:100) ==2464== by 0x6D09773: (below main) (libc-start.c:258) ==2464== 408 bytes in 3 blocks are possibly lost in loss record 222 of 725 ==2464== at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x5A89B98: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A8A1F2: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A8A3FF: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5B02E34: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5AFFAA5: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5AAF009: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A7A6D3: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x59B205C: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5984544: cuInit (in /usr/lib/libcuda.so.313.30) ==2464== by 0x568983B: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5689967: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35)
Additional Information:
I tried working on fewer cards (3, since this is the minimum required for the program), and it still crashes.
The above is incorrect, I configured the application incorrectly and used all four cards. Repeated experiments with really three cards seem to solve the problem, now it works for several hours under heavy load without failures. Now Iβll try to let her run a little more and maybe then try to use another subset of 3 cards to check this and at the same time check if the problem is related to one particular card or not.
I controlled the GPU temperature during test runs, and there seems to be something wrong. The cards reach about 78-80 Β° C at maximum load, and the fan goes about 56%, and this remains until the accident occurs (a few minutes), it seems not too high for me.
One thing that I was thinking about is the way the request is processed - there are many calls to cudaSetDevice, since each request generates a new stream (I use the mongoose library), and then this stream switches between cards by calling cudaSetDevice (id) with the corresponding device identifier . Switching can happen several times during one request, and I do not use any streams (so everything goes by default (0) stream IIRC). Could this be somehow related to crashes occurring in pthread_getspecific?
I also tried updating to the latest drivers (beta, 319.12), but that didn't help.