Will Knights Landing CPU (Xeon Phi) speed up byte / word encoding?

Question

Will Knights Landing CPU (Xeon Phi) speed up byte / word encoding?

The Intel Xeon Phi "Knights Landing" processor will be the first to support the AVX-512, but it will only support "F" (for example, SSE without SSE2 or AVX without AVX2), so mostly floating point.

I am writing software that works with bytes and words (8- and 16-bit) using SSE4.1 instructions through built-in functions.

I am confused if the AVX-512F versions encoded by EVEX will be all / most SSE4.1 instructions, and this means that I can expect my SSE code to automatically receive extended EVEX instructions and display all new registers,

Wikipedia says this:

The width of the SIMD register file is increased from 256 bits to 512 bits, for a total of 32 ZMM0-ZMM31 registers. These registers can be thought of as 256-bit YMM registers from AVX extensions and 128-bit XMM registers from SIMD stream extensions, and legacy AVX and SSE instructions can be expanded to work with 16 additional XMM16-XMM31 and YMM16-YMM31 registers when using EVEX encoded the form.

Unfortunately, it does not appear whether compiling SSE4 code with support for AVX512 will result in the same (surprising) acceleration as compiling it in AVX2 (VEX encoding of obsolete instructions).

Does anyone know what happens if the SSE2 / 4 (C intrinsics) code is compiled for the AVX-512F? Is it possible to count on speed, for example, with the AVX1 VEX encoding of byte and word instructions?

+5

c avx512 byte sse4 xeon-phi

user1649948 Jun 08 '16 at 21:56

source share

1 answer

user1649948 · Answer 1 · 2016-06-17T22:42:32+0000

Well, I think I have gathered enough information to get a decent answer. Here it is.

What happens when native SSE2 / 4 code is launched on Knights Landing (KNL)?

The code will run in the bottom quarter of registers on one VPU (called the compatibility level) inside the kernel. According to the Colfax pre-release webinar, this means that it takes 1/4 to 1/8 of the entire register space available to the kernel, and works in legacy mode.

What happens if the same code is recompiled with compiler flags for the AVX-512F?

SSE2 / 4 code will be generated using the VEX prefix. This means that pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to your own EVEX on the AVX512 or specifically authorized to register new zmm registers. Instructions can be upgraded to EVEX using the AVX512-VL, in which case they will be able to directly address (rename) the zmm registers. It is not known whether the registry can be exchanged at the moment, but AVX2 pipeline processing has demonstrated similar throughput with half-width AVX2 (AVX-128), as well as the full 256-bit AVX2 code in many cases.

Most importantly, how do I get the SSE2 / 4 / AVX128 byte / word code running on the AVX512F?

You will need to load 128-bit fragments in xmm, sign / zero to expand these bytes / words to 32-bit in zmm and work as if they were always large integers. Then, when done, convert back to bytes / words.

Is it fast?

According to material published on Larrabee (Knights Landing prototype), type conversions of any integral width are free from xmm to zmm and vice versa, subject to the availability of registers. In addition, after performing the calculations, 32-bit results can be truncated on the fly to the byte / word length and written (packed) into unchanged memory in 128-bit fragments, which can potentially save the xmm register.

In KNL, each core has 2 VPUs that seem to be able to talk to each other. Consequently, 32-way 32-bit searches are possible in a single vperm * 2d command, apparently of reasonable bandwidth. This is not possible even with AVX2, which can only be tuned in 128-bit tracks (or between tracks only for 32-bit vpermd, which does not apply to byte / word instructions). In combination with free type conversions, the ability to use masks implicitly using AVX512 (saving the expensive and intensive use of blenderv or explicitly creating a mask), as well as the presence of a larger number of comparators (native NOT, unsigned / signed lt / gt, etc.), it can provide a reasonable performance improvement to overwrite SSE2 / 4 bytes / word code for the AVX512F in the end. At least in KNL.

Do not worry, I will check the moment when I will join hands .; -)

Will Knights Landing CPU (Xeon Phi) speed up byte / word encoding?

More articles: