In the Beignet Optimization Guide, an OpenCL implementation targeting OpenCL on Intel GPUs
Workgroup Size must be greater than 16 and be a multiple of 16.
There are 8 or 16 as the two possible SIMD bands on Gen. In order not to waste SIMD, we must follow this rule.
Also mentioned in the computing architecture of the Intel Gen7.5 graphics platform :
For Gen7.5-based products, each EU has seven streams for a total of 28 KB of General Purpose Register (GRF) file.
...
In the Gen7.5 computing architecture, most SPMD programming models use this style code generation and EC processor execution. Effectively, each instance of the SPMD core appears to be executed sequentially and independently within its own SIMD band .
In reality, each thread executes the simultaneous number of SIMD-Width kernel instances. Thus, to compile SIMD-16 calculations , it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executed simultaneously in one EU. Similarly, for SIMD-32 x 7 threads = 224 kernel instances running simultaneously on the same EU.
If I understand correctly, using the SIMD-16 x 7 threads = 112 kernel instances example to run 224 threads in one EU, the workgroup size should be 16. Then the OpenCL compiler drops 16 kernel instances into 16 SIMD bands, and do this 7 times on 7 working groups and run them in one EU?
Question 1: Am I still right?
However, the OpenCL spec also provides vector data types. Thus, it is possible to fully utilize the computing resources of SIMD-16 in the EU using conventional SIMD programming (as in NEON and SSE).
Question 2: If this is the case, the use of the vector-16 data type already explicitly uses SIMD-16 resources, therefore it removes at least the -16 element for work -group restrictions. Is this the case?
Question 3: If all of the above is true, then how do the two approaches compare with each other: 1) 112 threads are dumped into 7 threads SIMD-16 OpenCL compiler; 2) 7 native streams encoded for explicit use of Vector-16 data types and SIMD-16 operations?