Intel Gen8 Architecture Calculating Common Kernel Instances per Unit

I take the link from intel_gen8_arch

Several sections cause confusion in my understanding of the concept of a SIMD engine.

5.3.2 SIMD FPU In each EU, primary computing units are a pair of SIMD floating point blocks (FPUs). Although they are called FPUs, they support floating point and integer calculations. These units can perform SIMD up to four 32-bit floating point (or integer) operations or perform SIMD up to eight 16-bit integer or 16-bit floating point operations. Each SIMD FPU can perform simultaneous addition and multiplication (MAD) with floating point each cycle. Thus, each EU is capable of 16 32-bit floating point operations per cycle: (add + mul) x 2 FPUs x SIMD-4.

The above document lines clearly indicate the maximum floating point operations that can be performed for each execution unit.

First doubt: I think this applies to one hardware thread of the Execution block, and not to the whole execution unit.

In section 5.3.5, he mentions that in the Gen8 computing architecture, most SPMD programming models use this style code for the generation and execution of EC processors. Effectively, each instance of the SPMD kernel runs in series and independently within its own SIMD band. In reality, each thread executes the number of SIMD-Width kernel instances at the same time. Thus, to compile the kernel for SIMD-16 compilation, it is possible to execute SIMD-16 x 7 threads = 112 kernel instances simultaneously in one EU. Similarly, to compile SIMD-32 computing core 32 x 7 threads = 224 kernel instances can be executed simultaneously in the same EU.

Now this illustration of the section seems contrary to section 5.3.2.

In particular, 1) Since he says that each branch of the EU HW has 2, SIMD-4 devices, then how SIMD-16 works. How do we achieve a calculation of 224 on 7 threads.

Also, how do we compile the kernel in SIMD-16 or SIMD-32 mode?

+4
source share
1 answer

5.3.2. the section really says that each EU can perform 16 32-bit operations. Each EU has two FPUs, each of which can carry out 4 operations.

2 pipes * 4 ops per pipe * 2 (since mad are add+mul) = 16 ops per cycle 

There are 7 flows in the EU (see Figure 3), but the EU can only select teams from two out of 7 (ready-made) (one instruction for each pipe).

As mentioned in Mai above, think of SIMD16 instructions as 4 of these SIMD4 operating systems. Therefore, 4 cycles are required to complete it. The SIMD32 command will run 8 cycles through the same SIMD4 channels. Therefore, regardless of the width of the SIMD, the throughput of the machine is the same (theoretically). An "extended" SIMD simply means that you use more registers and fewer threads per workload.

There is no easy way to choose the kernel compilation width (SIMD8, SIMD16 or SIMD32), and you probably don't want to do this for most workloads. However, there is an Intel extension that your driver can support cl_intel_subgroups , which allows you to control the width of the stream. (You must annotate the kernel with a special attribute.) This can be useful if you want SIMD channels (bands) to directly exchange data with each other (without additional loads for SLM or global memory).

Also check out this presentation from IDF. Slides 80-87 illustrate mapping from a compiler SIMU (e.g., SIMD32 or SIMD16) to the EU.

+2
source

All Articles