Segmentation error in __pthread_getspecific called from libcuda.so.1

Problem: Segmentation Error (SIGSEGV, Signal 11)

Short description of the program:

  • high-performance gpu server (CUDA), processing requests from a remote client server
  • Each incoming request generates a stream that performs calculations on several GPUs (serial, not parallel) and sends a result to the client, usually it takes from 10 to 200 ms, since each request consists of tens or hundreds of kernel calls.
  • Query handler requests have exclusive access to GPUs, which means that if one thread starts something on GPU1, everyone else will have to wait until it runs out.
  • compiled with -arch = sm_35 -code = compute_35
  • using CUDA 5.0
  • I do not use CUDA atomic atoms explicitly or any intranuclear synchronization barriers, although I use traction (various functions) and cudaDeviceSynchronize (), obviously
  • Nvidia driver: NVIDIA dlloader X driver 313.30 Wed Mar 27 15:33:21 PDT 2013

OS and HW Information:

  • Linux lub1 3.5.0-23-generi # 35 ~ exact1-Ubuntu x86_64 x86_64 x86_64 GNU / Linux
  • GPUs: 4x GPU 0: GeForce GTX TITAN
  • 32 GB RAM
  • MB: ASUS MAXIMUS V EXTREME
  • CPU: i7-3770K

Failure Information:

The failure occurs "accidentally" after processing several thousand requests (sometimes earlier, sometimes later). Stack traces from some crashes are as follows:

#0 0x00007f8a5b18fd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62 #1 0x00007f8a5a0c0cf3 in ?? () from /usr/lib/libcuda.so.1 #2 0x00007f8a59ff7b30 in ?? () from /usr/lib/libcuda.so.1 #3 0x00007f8a59fcc34a in ?? () from /usr/lib/libcuda.so.1 #4 0x00007f8a5ab253e7 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #5 0x00007f8a5ab484fa in cudaGetDevice () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #6 0x000000000046c2a6 in thrust::detail::backend::cuda::arch::device_properties() () #0 0x00007ff03ba35d91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62 #1 0x00007ff03a966cf3 in ?? () from /usr/lib/libcuda.so.1 #2 0x00007ff03aa24f8b in ?? () from /usr/lib/libcuda.so.1 #3 0x00007ff03b3e411c in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #4 0x00007ff03b3dd4b3 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #5 0x00007ff03b3d18e0 in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #6 0x00007ff03b3fc4d9 in cudaMemset () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #7 0x0000000000448177 in libgbase::cudaGenericDatabase::cudaCountIndividual(unsigned int, ... #0 0x00007f01db6d6153 in ?? () from /usr/lib/libcuda.so.1 #1 0x00007f01db6db7e4 in ?? () from /usr/lib/libcuda.so.1 #2 0x00007f01db6dbc30 in ?? () from /usr/lib/libcuda.so.1 #3 0x00007f01db6dbec2 in ?? () from /usr/lib/libcuda.so.1 #4 0x00007f01db6c6c58 in ?? () from /usr/lib/libcuda.so.1 #5 0x00007f01db6c7b49 in ?? () from /usr/lib/libcuda.so.1 #6 0x00007f01db6bdc22 in ?? () from /usr/lib/libcuda.so.1 #7 0x00007f01db5f0df7 in ?? () from /usr/lib/libcuda.so.1 #8 0x00007f01db5f4e0d in ?? () from /usr/lib/libcuda.so.1 #9 0x00007f01db5dbcea in ?? () from /usr/lib/libcuda.so.1 #10 0x00007f01dc11e0aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #11 0x00007f01dc1466dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #12 0x0000000000472373 in thrust::detail::backend::cuda::detail::b40c_thrust::BaseRadixSortingEnactor #0 0x00007f397533dd91 in __pthread_getspecific (key=4) at pthread_getspecific.c:62 #1 0x00007f397426ecf3 in ?? () from /usr/lib/libcuda.so.1 #2 0x00007f397427baec in ?? () from /usr/lib/libcuda.so.1 #3 0x00007f39741a9840 in ?? () from /usr/lib/libcuda.so.1 #4 0x00007f39741add08 in ?? () from /usr/lib/libcuda.so.1 #5 0x00007f3974194cea in ?? () from /usr/lib/libcuda.so.1 #6 0x00007f3974cd70aa in ?? () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #7 0x00007f3974cff6dd in cudaMemcpy () from /usr/local/cuda-5.0/lib64/libcudart.so.5.0 #8 0x000000000046bf26 in thrust::detail::backend::cuda::detail::checked_cudaMemcpy(void* 

As you can see, it usually ends in __pthread_getspecific , called from libcuda.so or somewhere in the library itself. As far as I remember, there was only one case when it did not work, but instead it hung in a strange way: the program was able to answer my requests if they did not involve GPU calculations (statistics, etc.), but otherwise I never did not receive a response. Also, running nvidia-smi -L did not work, it just hung there until I rebooted the computer. Looked at me as a dead end GPU. This may be a completely different issue than this.

Does anyone know where a problem might occur or what might cause it?

Update:

Some additional analysis:

  • cuda-memcheck does not display any error messages.
  • valgrind - leak check prints quite a lot of messages, for example below (the same hundreds):
 ==2464== 16 bytes in 1 blocks are definitely lost in loss record 6 of 725 ==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x56B859D: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5050C82: __nptl_deallocate_tsd (pthread_create.c:156) ==2464== by 0x5050EA7: start_thread (pthread_create.c:315) ==2464== by 0x6DDBCBC: clone (clone.S:112) ==2464== ==2464== 16 bytes in 1 blocks are definitely lost in loss record 7 of 725 ==2464== at 0x4C2B1C7: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x568C202: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x56B86D8: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5677E0F: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x400F90D: _dl_fini (dl-fini.c:254) ==2464== by 0x6D23900: __run_exit_handlers (exit.c:78) ==2464== by 0x6D23984: exit (exit.c:100) ==2464== by 0x6D09773: (below main) (libc-start.c:258) ==2464== 408 bytes in 3 blocks are possibly lost in loss record 222 of 725 ==2464== at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==2464== by 0x5A89B98: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A8A1F2: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A8A3FF: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5B02E34: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5AFFAA5: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5AAF009: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5A7A6D3: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x59B205C: ??? (in /usr/lib/libcuda.so.313.30) ==2464== by 0x5984544: cuInit (in /usr/lib/libcuda.so.313.30) ==2464== by 0x568983B: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) ==2464== by 0x5689967: ??? (in /usr/local/cuda-5.0/lib64/libcudart.so.5.0.35) 

Additional Information:

I tried working on fewer cards (3, since this is the minimum required for the program), and it still crashes.

The above is incorrect, I configured the application incorrectly and used all four cards. Repeated experiments with really three cards seem to solve the problem, now it works for several hours under heavy load without failures. Now I’ll try to let her run a little more and maybe then try to use another subset of 3 cards to check this and at the same time check if the problem is related to one particular card or not.

I controlled the GPU temperature during test runs, and there seems to be something wrong. The cards reach about 78-80 Β° C at maximum load, and the fan goes about 56%, and this remains until the accident occurs (a few minutes), it seems not too high for me.

One thing that I was thinking about is the way the request is processed - there are many calls to cudaSetDevice, since each request generates a new stream (I use the mongoose library), and then this stream switches between cards by calling cudaSetDevice (id) with the corresponding device identifier . Switching can happen several times during one request, and I do not use any streams (so everything goes by default (0) stream IIRC). Could this be somehow related to crashes occurring in pthread_getspecific?

I also tried updating to the latest drivers (beta, 319.12), but that didn't help.

+7
source share
1 answer

If you can identify 3 cards that work, try making the 4th card instead of one of the three, and see if you get crashes again. I think this is standard troubleshooting. If you can identify one card, which, if included in a group of 3, still causes a problem, then this card is suspected.

But my proposal to work with fewer cards was also based on the idea that this could reduce the overall load on the PSU. Even at 1500 watts, you may not have enough juice. So if you loop the 4th card, instead of one of the 3 (i.e., you still save only 3 cards in the system or configure the application to use 3), and you don't get crashes, the problem can be caused by the total power consumption with 4 cards.

Please note that the power consumption of the GTX Titan at full load may be of the order of 250 W or possibly more. Therefore, it might seem that your 1500 W power supply should be fine, but you may need to carefully analyze how much direct current is available on each bus, and how the motherboard and power supply distribute DC 12V rails to each GPU.

So, if reducing to 3GPU seems to fix the problem, no matter which 3 you use, I assume your power supply is not up to the task. Not all 1500 watts are available from a single DC rail. A β€œ12 V rail” actually consists of several different 12 V rails, each of which provides a certain part of the total 1500 W. Therefore, even if you cannot pull 1500 watts, you can still overload one rail, depending on how the GPU power is connected to the rails.

I agree that temperatures in the 80C range should be accurate, but that means (approximately) a fully loaded GPU, so if you see that on all 4 GPUs at the same time, you are pulling a lot of load.

+5
source

All Articles