I want to check if cudaMalloc and cudaFree are synchronous calls, so I made some changes to the simpleMultiGPU.cu code example in the CUDA SDK. The following is the part that I changed (the added lines do not back off):
float *dd[GPU_N];; for (i = 0; i < GPU_N; i++){cudaSetDevice(i); cudaMalloc((void**)&dd[i], sizeof(float));} //Start timing and compute on GPU(s) printf("Computing with %d GPUs...\n", GPU_N); StartTimer(); //Copy data to GPU, launch the kernel and copy data back. All asynchronously for (i = 0; i < GPU_N; i++) { //Set device checkCudaErrors(cudaSetDevice(i)); //Copy input data from CPU checkCudaErrors(cudaMemcpyAsync(plan[i].d_Data, plan[i].h_Data, plan[i].dataN * sizeof(float), cudaMemcpyHostToDevice, plan[i].stream)); //Perform GPU computations reduceKernel<<<BLOCK_N, THREAD_N, 0, plan[i].stream>>>(plan[i].d_Sum, plan[i].d_Data, plan[i].dataN); getLastCudaError("reduceKernel() execution failed.\n"); //Read back GPU results checkCudaErrors(cudaMemcpyAsync(plan[i].h_Sum_from_device, plan[i].d_Sum, ACCUM_N *sizeof(float), cudaMemcpyDeviceToHost, plan[i].stream)); cudaMalloc((void**)&dd[i],sizeof(float)); cudaFree(dd[i]); //cudaStreamSynchronize(plan[i].stream); }
Commenting on the cudaMalloc line and the cudaFree line, respectively, in a large cycle, I found that for a system with 2 GPUs, the GPU processing time is 30 milliseconds and 20 milliseconds, respectively, so I came to the conclusion that cudaMalloc is an asynchronous call and cudaFree is a synchronous call. Not sure if this is true or not, and not sure what constitutes a design problem for the CUDA architecture. My computing ability is 2.0, and I have tried both cuda4.0 and cuda5.0.
source share