What in practice optimizes SSE instructions and how does the compiler resolve and use them?

SSE and / or 3D now! have vector instructions, but what do they optimize in practice? For example, 8-bit characters are processed 4 by 4 instead of 1 by 1? Is there any optimization for some arithmetic operations? Does the word size have any effect (16 bits, 32 bits, 64 bits)?

Do all compilers use them when they are available?

Do I need to understand the assembly to use SSE instructions? Does it know about electronics and gate logic, helps to understand this?

+4
source share
4 answers

Background: SSE has both vector and scalar instructions. 3DNow! is dead.

Often, the compiler takes a significant advantage from vectorization without the help of a programmer. Thanks to programming and experimenting efforts, one can often get close to clean build speed without mentioning any specific vector instructions. See the vector compiler programming guide for more details.

There are a couple of portability abilities. If you encode the GCC vector pointer, you can work with architectures other than Intel, such as PowerPC and ARM, but not with other compilers. If you use Intel intrinsics to make your C code more like a build, you can use other compilers, but not other architectures.

Knowing electronics will not help you. A study of the available instructions will be.

+4
source

In general, you cannot rely on compilers to use vectorized instructions. Some of them (Intel C ++ compiler does reasonable work in many simple cases, and GCC also does this with mixed success)

But the idea is to simply apply the same operation to 4 32-bit words (or in some cases to two 64-bit values).

Thus, instead of the traditional add statement, which combines the values ​​from two different 32-bit registers, you can use a vectorized addition that uses special 128-bit registers with four 32-bit values ​​and adds them together as one operation.

+3
source

Duplicate Other Issues: Using SSE Instructions

In short, SSE is short for streaming SIMD extensions, where SIMD = Single Instruction, Multiple Data. This is useful for performing one mathematical or logical operation for many values ​​at the same time, as is usually done for mathematical or vector mathematical operations.

The compiler can tweak this instruction set as part of its optimization (exploring your / O options), however you usually have to restructure your code and SSE code manually or use a library like Intel Performance Priorities to really take advantage of it.

+1
source

If you know what you are doing, you can get a huge performance boost. See, for example, here , where this guy 6 times improved the characteristics of his algorithm.

0
source

All Articles