I take the link from intel_gen8_arch
Several sections cause confusion in my understanding of the concept of a SIMD engine.
5.3.2 SIMD FPU In each EU, primary computing units are a pair of SIMD floating point blocks (FPUs). Although they are called FPUs, they support floating point and integer calculations. These units can perform SIMD up to four 32-bit floating point (or integer) operations or perform SIMD up to eight 16-bit integer or 16-bit floating point operations. Each SIMD FPU can perform simultaneous addition and multiplication (MAD) with floating point each cycle. Thus, each EU is capable of 16 32-bit floating point operations per cycle: (add + mul) x 2 FPUs x SIMD-4.
The above document lines clearly indicate the maximum floating point operations that can be performed for each execution unit.
First doubt: I think this applies to one hardware thread of the Execution block, and not to the whole execution unit.
In section 5.3.5, he mentions that in the Gen8 computing architecture, most SPMD programming models use this style code for the generation and execution of EC processors. Effectively, each instance of the SPMD kernel runs in series and independently within its own SIMD band. In reality, each thread executes the number of SIMD-Width kernel instances at the same time. Thus, to compile the kernel for SIMD-16 compilation, it is possible to execute SIMD-16 x 7 threads = 112 kernel instances simultaneously in one EU. Similarly, to compile SIMD-32 computing core 32 x 7 threads = 224 kernel instances can be executed simultaneously in the same EU.
Now this illustration of the section seems contrary to section 5.3.2.
In particular, 1) Since he says that each branch of the EU HW has 2, SIMD-4 devices, then how SIMD-16 works. How do we achieve a calculation of 224 on 7 threads.
Also, how do we compile the kernel in SIMD-16 or SIMD-32 mode?
source share