Is _mm_broadcast_ss faster than _mm_set1_ps?

Question

Is _mm_broadcast_ss faster than _mm_set1_ps?

This code

float a = ...; __m256 b = _mm_broadcast_ss(&a)

always faster than this code

 float a = ...; _mm_set1_ps(a)

?

What if a is defined as static const float a = ... rather than float a = ... ?

+6

vectorization avx

Ben-uri Nov 04 '12 at 12:09

source share

3 answers

If you are targeting a set of AVX instructions, gcc will use VBROADCASTSS to implement _mm_set1_ps intrinsic. However, Klang will use two instructions (VMOVSS + VPSHUFD).

+8

Marat dukhan Nov 04 '12 at 19:49

source share

_mm_broadcast_ss has architecture-impaired flaws that are largely hidden by the SSE mm API. The most important difference is the following:

_mm_broadcast_ss is limited to loading values only from memory.

What does this mean if you explicitly use _mm_broadcast_ss in a situation where the source is not in memory, then the result will most likely be less effective than the result of using _mm_set1_ps. This situation usually occurs when loading instantaneous values (constants) or when using the result of a recent calculation. In these situations, the result will be case-sensitive by the compiler. To use a value for broadcasting, the compiler must return the value back to memory. Alternatively, pshufd can be used to splat directly from the register.

_mm_set1_ps is determined by the implementation, and is not mapped to a specific basic operation (instruction) of the processor. This means that it can use one of several SSE instructions to execute splat. A smart compiler with AVX support enabled must use vbroadcastss internally when necessary, but it depends on the state of the AVX compiler optimizer implementation.

If you are very sure that you are loading from memory - for example, iterating over an array of data, then direct use of broadcast transmission is okay. But if you have any doubts, I would recommend sticking with _mm_set1_ps.

And in the specific case of static const float you absolutely do not want to use _mm_broadcast_ss ().

+7

jstine Oct 27 '14 at 21:10

source share

Jason r · Accepted Answer · 2012-11-04T13:31:54+0000

mm_broadcast_ss is likely to be faster than mm_set1_ps. The former is converted to a single command (VBROADCASTSS), while the latter is emulated using several instructions (possibly MOVSS, followed by a shuffle). However, mm_broadcast_ss requires a set of AVX commands, while mm_set1_ps only requires SSE.

Is _mm_broadcast_ss faster than _mm_set1_ps?

More articles: