X64 SSE Data Types

AMD64 Architecture Programming Guide Volume 1: Application Programming p. 226 talks about SSE instructions:

The processor does not check the data type of the operands of the instructions until instructions are executed. This only checks them at the execution point. For example, if a processor performs arithmetic instructions that accepts operands with double precision, but is supplied with operands with single precision MOVx, the processor first converts the operands with one precision to double precision before performing the arithmetic operation, and the result is correct. However, the required conversion can lead to poor performance.

I do not understand this; I would think that the ymm registers simply contain 256 bits, which each command interprets in accordance with the expected operands, it is up to you to make sure that the correct types exist, and in the described scenario the processor will run at full speed and silently give the wrong answer .

What am I missing?

+7
source share
1 answer

Intel® 64 and IA-32 Architecture Optimization Reference Guide and Section 5.1 says something similar about mixing FP integers / data types (but curiously not singles and doubles):

When writing SIMD code that works for both integer and floating point data, use a subset of the SIMD conversion commands or load / store instructions to ensure that the input operands in XMM registers contain data types that are correctly defined to match the instruction.

Code sequences containing cross-typed use give the same result in different implementations, but carry a significant performance penalty. Using Instruction SSE / SSE2 / SSE3 / SSSE3 / SSE44.1 to work with non-conforming types SIMD data in the XMM register is strongly discouraged.

Intel® 64 and IA-32 Architectures Software Developer Guide is a bit confusing:

The SSE and SSE2 extensions define typed operations for packed and scalar floating point data types and on 128-bit SIMD integer data types, but IA-32 processors do not apply this typification at the architectural level. They only apply it at the microarchitectural level.

...

Pentium 4 and Intel Xeon processors follow these instructions without generating an invalid operand exception (#UD) and will produce the expected results in the XMM0 register (that is, the high and low 64-bit of each register will be treated as a double-precision floating-point value, and the processor will work on them accordingly).

...

In this example: XORPS or PXOR can be used instead of XORPD and get the same correct result. However, due to the type mismatch between the operand data type and the instruction data type, a latency penalty will occur due to the execution of instructions at the microarchitecture level.

Delay delays can also occur when using move commands of the wrong type. For example, MOVAPS and MOVAPD can be used to move a packed single precision operand from memory to the XMM register. However, if MOVAPD is used, a latency penalty will occur when a correctly entered command tries to use data in a register.

Please note that these latent penalties do not occur when moving data from XMM registers to memory.

I really don’t know what it means, “they only apply it at the microarchitectural level”, except that it assumes that different “data types” are processed differently by the arch. I have a few guesses:

  • AIUI, x86 cells typically use register renaming due to a lack of registers. Perhaps they internally use different registers for integer / single / double operands, so they can be located closer to the corresponding vector units.
  • It also seems possible that FP numbers are represented internally using a different format (for example, using a higher metric to get rid of denorms) and are converted to canonical bits only if necessary.
  • CPUs use " forwarding " or "bypass", so executive devices do not need to wait until data is written to before it is used by subsequent instructions, usually keeping a loop or two. This may not happen between integer and FP units.
+1
source

All Articles