How to properly configure Android RenderScript code on Nvidia Shield

Question

How to properly configure Android RenderScript code on Nvidia Shield

I have implemented a small CNN in RenderScript and want to profile performance on other hardware. On my Nexus 7, times make sense, but not on NVIDIA Shield.

CNN (LeNet) is implemented in 9 layers in the queue, the calculation is performed sequentially. Each level is assigned individually.

Here is an example:

conv1 pool1 conv2 pool2 resh1 ip1 relu1 ip2 softmax nexus7 11.177 7.813 13.357 8.367 8.097 2.1 0.326 1.557 2.667 shield 13.219 1.024 1.567 1.081 0.988 14.588 13.323 14.318 40.347

The time distribution is approximately suitable for communication, with conv1 and conv2 (convolution levels) taking up most of the time. But on the shield, the times drop higher than is reasonable for layers 2-4, and it seems that they are about to end. The softmax layer is a relatively small job, so 40 ms is too long. My synchronization method must be faulty or something else is happening.

The code executing the layers looks something like this:

 double[] times = new double[layers.size()]; int layerindex = 0; for (Layer a : layers) { double t = SystemClock.elapsedRealtime(); //long t = System.currentTimeMillis(); // makes no difference blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc //mRS.finish(); // makes no difference t = SystemClock.elapsedRealtime() - t; //t = System.currentTimeMillis() - t; // makes no difference times[layerindex] += t; // later we take average etc layerindex++; }

I understand that once forEach_ () returns, work must be completed. In any case, mRS.finish () should provide the final barrier. But, looking at the time, the only reasonable explanation is that jobs are still being processed in the background.

The application is very simple, I just run the test from MainActivity and type in logcat. Android Studio creates the application as a release and launches it on a USB device.

(1) What is the correct way to handle RenderScript processes? (2) Is it true that when returning forEach_ (), the threads generated by the script are guaranteed? (3) In my test application, I just run directly from MainActivity. Is this a problem (other than blocking the user interface thread and application inactivity)? If it affects time or is weird, what is the right way to set up a test application like this?

+6

performance android multithreading timing renderscript

frankhond May 6 '16 at 8:11

source share

2 answers

Maybe a little off topic: but for CNN, if you can structure your algorithm using matrix matrix multiplication as the main computing units, you can actually use RenderScript IntrinsicBLAS , especially BNNM and SGEMM .

Pros:

The high-performance implementation of 8-bit matrix multiplication (BNNM), available in N Preview .
Back back to Android 2.3 via RenderScript Support lib when using Build-Tools 24.0.0 rc3 and higher.
High performance SGEMM GPU acceleration on Nexus5X and 6P with N Preview build NPC91K.
If you use only RenderScript Intrinsics, you can encode everything in java.

Minuses:

Perhaps your algorithm needs to be reorganized and should be based on multiplying the 2d matrix.
Although it is available in Android 6.0, the performance of BNNM 6.0 is not satisfactory. Therefore, it is better to use lib support for BNNM and set targetSdkVersion to 24.
SGEMM GPU Acceleration is currently only available on the Nexus5X and Nexus6P. And it is currently required that the width and height of the matrices be a multiple of 8.

It is worth a try if BLAS fits into your algorithm. And it is easy to use:

  import android.support.v8.renderscript.*; // if you are not using support lib: // import android.renderscript.*; private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) { Allocation A, B, C; Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS)); Type a_type = builder.setX(k).setY(m).create(); Type b_type = builder.setX(k).setY(n).create(); Type c_type = builder.setX(n).setY(m).create(); // If you are reusing the input Allocations, just create and cache them somewhere else. A = Allocation.createTyped(mRS, a_type); B = Allocation.createTyped(mRS, b_type); C = Allocation.createTyped(mRS, c_type); A.copyFrom(a_byte); B.copyFrom(b_byte); ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS); // Computes: C = A * B.Transpose int a_offset = 0; int b_offset = 0; int c_offset = 0; int c_multiplier = 1; blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier); }

SGEMM is similar:

  ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS); // Construct the Allocations: A, B, C somewhere and make sure the dimensions match. // Computes: C = 1.0f * A * B + 0.0f * C float alpha = 1.0f; float beta = 0.0f; blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE, alpha, A, B, beta, C);

+3

Miao wang May 13, '16 at 22:29

source share

monoeci · Accepted Answer · 2016-05-06T20:52:52+0000

I myself implemented CNN in RenderScript, and, as you explain, it requires a chain of several processes and a call to forEach_*() different times for each layer, if you implement them as another kernel. Therefore, I can assure you that returning the forEach call does not really guarantee completion of the process. Theoretically, this will only plan the kernel, and all requests in the queue will be executed whenever the system determines this best, especially if they are processed in a tablet GPU.

Usually the only way to make sure that you have some kind of control over the kernel that really works is to explicitly read the output of the RS kernel between layers, for example, using .copyTo() in the output highlight object is the kernel. This "forces" any given RS tasks that have not yet been performed (which the allocation of this level depends on) to be executed at this time. Of course, this can lead to overhead for data transfer, and your time will not be completely accurate - in fact, the execution time of the entire network will certainly be lower than the sum of the individual layers, if this is provided for in this way. But as far as I know, this is the only reliable way to split the individual cores in a chain, and it will give you some feedback to figure out where the bottlenecks are, and better manage your optimization if that is what you need.

How to properly configure Android RenderScript code on Nvidia Shield

More articles: