Let me answer your last question first:
Would copy only 1 integer after every kernel run slows my program down?
- . , GPU .. .. (1 int vs 100 ints), , . , . , , ( )
?
, . : cudaMemcpy. , , , - . .
, cudaThreadsynchronize(), , . .
cudaMemcpyAsync, , GPU cudaMemcpyAsync, , .
, , , , . , . - , , CUDA .