RenderScript 10x acceleration with default forced CPU implementation

I implemented CNN in RenderScript, described in the previous question that spawned this one. In principle, at startup

adb shell setprop debug.rs.default-CPU-driver 1 

There is a 10x acceleration on both Nvidia Shield and Nexus 7. The average calculation time is from 50 ms to 5 ms, the test application is from 50 to 130 or more. There are two convolution algorithms:

(1) moving core
(2) im2col and GEMM from RenderScriptIntrinsicsBLAS.

Both experience similar acceleration. The question is why this happens and can this effect be created from code in a predictable way? And detailed information about this is available somewhere?

Edit:

In accordance with the recommendations below, I checked the use of finish () and copyTo (), here is a breakdown of the procedure. The accelerated state message occurs after calling copyTo (), but without the finish (). Uncommenting finish () adds about 1 ms to the time.

 double forwardTime = 0; long t = System.currentTimeMillis(); //double t = SystemClock.elapsedRealtime(); // makes no difference for (Layer a : layers) { blob = a.forward(blob); } mRS.finish(); // adds about 1ms to measured time blob.copyTo(outbuf); forwardTime = System.currentTimeMillis() - t;​ 

This may not be related, but an error message on startup appears on the NVIDIA Shield screen, which disappears when you start using the adb shell setprop debug.rs.default-CPU-driver 1

 E/Renderscript: rsAssert failed: 0, in vendor/nvidia/tegra/compute/rs/driver/nv/rsdNvBcc.cpp 

I am setting compileSdkVersion, minSdkVersion and targetSdkVersion to 23 right now, with buildToolsVersion "23.0.2". The tablets are fully adapted to the latest version of Android. Not sure about the minimum goal I need to set and you have ScriptIntrinsicsBLAS.

I use #pragma rs_fp_relaxed in all scenarios. Allocations use flags by default.
This question has a similar situation, but it turned out that the OP created new Script objects every computational round. I do nothing, all scripts and distributions are created during init.

+2
source share
2 answers

There is a comment mRS.finish () in the original post. I wonder if that is the case.

To properly evaluate RenderScript, we need to wait for pending asynchronous sessions. There are usually two ways to do this:

  • Use RenderScript.finish () . This works well when using debug.rs.default-CPU-driver 1 . And it also works with most GPU drivers. However, some GPU drivers see this as NOOP.
  • Use Allocation.copyTo () or other similar APIs to access distribution data, preferably final allocation. This is actually a trick, but it works on all devices. Just keep in mind that the copyTo operation may take some time and make sure you take this into account.

5ms seems suspicious here, it may be real depending on the algorithm. But it’s worth checking if this is the case when you add the finish () or copyTo ().

0
source

It is very strange. The fact that you get the same result on both devices and with two very different implementations of conv layers suggests that something else is still happening compared to the benchmarking itself or the time, and not the differences with CPU / GPU execution since everything is rarely final.

I would suggest checking that the exits from copyTo () are always the same. Configure the output of logcat, say, the first (and last!) 10 values ​​in a floating-point array that is returned from the distribution of each level to make sure that all implementations and execution modes actually process the data correctly and equally at each level.

Depending on your installation, it is also possible that the overhead costs for copying the data mentioned above can lead to longer calculation times, and what you see is just an unfortunate effect of this, since it is possible to copy data from one place or another takes more or less time. Try increasing the size and number of conv containers (with dummy / random values, just for testing) to make the calculations more complicated and thus compensate for the balance of computational time and data and see how this affects your results.

If everything else fails, it may simply be because for some reason the GPU does take longer, although it can be difficult to determine the cause. Some things to check ... What data types and size do you use for data? How do you load / write data to distributions? Are you using #pragma rs_fp_relaxed already to set the accuracy of your floats? What flags do you set to use distribution (for example, Allocation.USAGE_SCRIPT | Allocation.USAGE_GRAPHICS_TEXTURE)?

And as for your last question, detailed RS documentation on specific optimization issues is still very scarce ... I think just asking here about SO is still one of the best resources available :)

0
source

All Articles