How to make the most of SIMD in OpenCL?

In the Beignet Optimization Guide, an OpenCL implementation targeting OpenCL on Intel GPUs

Workgroup Size must be greater than 16 and be a multiple of 16.

There are 8 or 16 as the two possible SIMD bands on Gen. In order not to waste SIMD, we must follow this rule.

Also mentioned in the computing architecture of the Intel Gen7.5 graphics platform :

For Gen7.5-based products, each EU has seven streams for a total of 28 KB of General Purpose Register (GRF) file.

...

In the Gen7.5 computing architecture, most SPMD programming models use this style code generation and EC processor execution. Effectively, each instance of the SPMD core appears to be executed sequentially and independently within its own SIMD band .

In reality, each thread executes the simultaneous number of SIMD-Width kernel instances. Thus, to compile SIMD-16 calculations , it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executed simultaneously in one EU. Similarly, for SIMD-32 x 7 threads = 224 kernel instances running simultaneously on the same EU.

If I understand correctly, using the SIMD-16 x 7 threads = 112 kernel instances example to run 224 threads in one EU, the workgroup size should be 16. Then the OpenCL compiler drops 16 kernel instances into 16 SIMD bands, and do this 7 times on 7 working groups and run them in one EU?

Question 1: Am I still right?

However, the OpenCL spec also provides vector data types. Thus, it is possible to fully utilize the computing resources of SIMD-16 in the EU using conventional SIMD programming (as in NEON and SSE).

Question 2: If this is the case, the use of the vector-16 data type already explicitly uses SIMD-16 resources, therefore it removes at least the -16 element for work -group restrictions. Is this the case?

Question 3: If all of the above is true, then how do the two approaches compare with each other: 1) 112 threads are dumped into 7 threads SIMD-16 OpenCL compiler; 2) 7 native streams encoded for explicit use of Vector-16 data types and SIMD-16 operations?

+8
simd opencl gpgpu spmd
source share
1 answer
  • Nearly. You make the assumption that there is one thread per workgroup (the NB thread in this context is what CUDA causes the “wave.” Intel GPUs say that the work item is the SIMD channel of the GPU thread). Without subgroups, it is impossible to make the size of the workgroup be exactly the thread. For example, if you choose WG 16, the compiler will still be able to compile SIMD8 and distribute it among the two SIMD8 streams. Keep in mind that the compiler chooses the width of the SIMD before it knows the size of the WG ( clCompileProgram preceded by clEnqueueNDRange ). Subgroup expansion may allow you to force the width of the SIMD, but it is definitely not implemented in GEN7.5.

  • OpenCL vector types are an optional explicit vectorization step on top of an implicit vectorization that already happens automatically. You could use float16 , for example. Each of the work items will process 16 floats each, but the compiler will still compile at least SIMD8. Therefore, each GPU thread will process (8 * 16) floats (in parallel). This may be a little redundant. Ideally, we do not want to explicitly specify the CL vector using explicit OpenCL vector types. But sometimes it is useful if the kernel does not work enough (kernels that are too short can be bad). Somewhere, he says float4 is a good rule of thumb.

  • I think you meant 112 work items? By your own thread, do you mean processor threads or GPU threads?

    • If you meant processor threads, the usual arguments for GPUs apply. GPUs are good when your program does not diverge very much (all instances use similar paths), and you use the data enough time to reduce the costs that transfer them to the GPU (arithmetic density).
    • If you mean GPU threads (GEN SIMD8 or SIMD16 generators). There is currently no (public) way to directly program the GPU threads ( EDIT ) into a subgroup extension (not available in GEN7.5)). If you were able to, it would be a similar compromise with assembly language. The work is more complicated, and the compiler sometimes just does a better job than we can, but when you solve a specific problem and gain better knowledge in the field, you can do better with enough programming effort (as long as the hardware changes and your smart assumptions about the program become invalid.)
+1
source share

All Articles