I am not an expert on jpeg and compression topics, but since your problem is pretty much limited by I / O, since it becomes (if you can turn around without the heavy computations involved in encoding), you may not be able to speed it up on the GPU that you have there is. (Un) Fortunately, your link is a fairly slow Atom processor.
I assume that Radeon has a separate main memory. This means that data must be transmitted via PCI-E, which is an additional delay compared to CPU execution and without hiding you can be sure that this is a bottleneck. This is the most likely reason that your code that uses OpenCV on the GPU is slow (besides the fact that you are doing two memory-binding operations, transpose and flip instead of one).
The main thing is to hide as much PCI-E transfer time as possible with multiple buffering . Overlapping transmissions both on the GPU and from it using computations using the full duplex feature of PCI-E will work only if this card has engines with two DMAs, for example high-end Radeons or NVIDIA Quadro / Tesla cards - which I really doubt.
If your GPU computing time (the time it takes for the GPU to rotate) was lower than the transfer time, you wonβt be able to completely overlap. The HD 4530 has a rather slow memory interface with a maximum peak of 12.8 Gbit / s , and the rotation core should be fully memory-bound. However, I can only evaluate, but I would say that if you reach the maximum transfer rate of PCI-E ~ 1.5 Gb / s (4x PCI-E AFAIK), the computing core will be several times faster than transmission, you can overlap a little. You can simply separate the parts separately without requiring complex asynchronous code, and you can evaluate how quickly you can achieve the optimal match.
One thing you might want to consider is getting hardware that does not demonstrate PCI-E as a bottleneck, for example:
- AMD APU system . On these platforms, you can lock the page lock and use it directly from the GPU;
- integrated GPUs that share the main memory with the host;
- a fast low-power processor such as an Ivy Bridge mobile modem, for example. i5-3427U , which consumes almost as much as the Atom D525 but has AVX support and should be several times faster.
pszilard
source share