I have a task to see if it is possible to quickly develop an algorithm that I developed using calculations on the GPU, not the processor. I am new to accelerator computing, I was given the book “C ++ AMP”, which I read completely, and I thought that I understood it quite well (I encoded in C and C ++ in the past, but currently it mainly C #).
However, when you go to the real application, I seem to just not understand. So please help me if you can.
Let's say I have a task to calculate some complex function that takes a huge matrix input (e.g. 50,000 x 50,000) and some other data and outputs a matrix of the same size. The general calculation for the entire matrix takes several hours.
In the CPU, I just cut the tasks into several parts (the number of pieces is approximately 100 or so) and execute them using Parralel.For or just a simple task management cycle that I wrote myself. Basically, support multiple threads (number of threads = number of cores), start a new part when the thread ends, until all parts are completed. And it worked well!
However, on the GPU, I can’t use the same approach not only because of memory limitations (this is normal, it can be divided into several parts), but because if something works for more than 2 seconds, this is considered a “time” out "and the GPU gets reset! Therefore, I must ensure that each part of my calculations takes less than 2 seconds.
(, 60 1 ), , , , ( ), ( parralel_for_each) , 2 , GPU reset.
, , , - , , , !
, ? ( N-Body) , 100x ( 2 gflops w/e , , amp 200 gflops), , , !
, , , 10 100 parralel_for_each ?
, , ?
!