Why does my Hello world program take almost 10 seconds?

I installed CUDA runtime and version 7.0 drivers on my workstation (Ubuntu 14.04, 2xIntel XEON e5 + 4x Tesla k20m). I used the following program to check if my installation works:

#include <stdio.h> __global__ void helloFromGPU() { printf("Hello World from GPU!\n"); } int main(int argc, char **argv) { printf("Hello World from CPU!\n"); helloFromGPU<<<1, 1>>>(); printf("Hello World from CPU! Again!\n"); cudaDeviceSynchronize(); printf("Hello World from CPU! Yet again!\n"); return 0; } 

I get the correct output, but it took enourmus time:

 $ nvcc hello.cu -O2 $ time ./hello > /dev/null real 0m8.897s user 0m0.004s sys 0m1.017s` 

If I delete all the device code, the total execution will take 0.001 s. So why does my simple program almost take 10 seconds?

+5
source share
1 answer

The obvious slow execution time of your example is due to the basic fixed cost of setting up the GPU context.

Since you are running a platform that supports unified addressing, the CUDA runtime needs to map 64 GB of RAM and 4 x 5120 MB from your GPUs to a single virtual address space and register this with the Linux kernel.

This requires many kernel API calls, and it is not fast. I would suggest that this is the main source of the slow work you are observing. You should consider this as a fixed initial cost that must be amortized over the life of your application. In real applications, a 10-second run is trivial and doesn't really matter. In the welcome example of the world, this is not so.

+6
source

All Articles