Faster whole SSE unallocated load, which is rarely used

Question

Faster whole SSE unallocated load, which is rarely used

I would like to know more about the built-in ( lddquinstructions _mm_lddqu_si128with SSE3), especially compared to the built-in command _mm_loadu_si128(movdqu command with SSE2).

I just discovered _mm_lddqu_si128today. Intel intelligent guide says

this internal function may work better than _mm_loadu_si128 when data crosses the cache line boundary

and the comment says it

will work better in certain circumstances, but will never be worse.

So why is it no longer in use (SSE3 is a pretty low band, since all Core2 processors have it)? Why can it work better when data crosses the cache line? Is it lddqupossible only on a specific subset of processors. For instance. before Nehal?

I understand that I can read the Intel manual to find the answer, but I think this question may be of interest to other people.

+4

x86 sse intrinsics

Z boson Jul 14 '16 at 9:33

source share

1 answer

Peter Cordes · Accepted Answer · 2016-07-14T20:23:00+0000

lddqu , movdqu P4, , . ( SSE3 AMD ), , P4.

Dark Shikari ( x264, SSE) 2008 . archive.org, , .

, , , Core2 - , , palignr , . Core2 lddqu , movdqu, .

-, Core1 lddqu , P4 .

Intel lddqu/movdqu ( 2 Google lddqu vs movdqu,/scold @Zboson) :

( P4): 32- , 16- , 16 , .
, , . Lddqu Uncached (UC) Write-Combining (USWC). , lddquack , .

, , , movdqu .

, , , . "" WB-, , , . ( - ).

:

Intel Core 2 (Core microarchitecture, 2006 , Merom ) : lddqu , movdqu
:
* SIMD Extensions 3 (SSSE3) → lddqu , movdqu,
* CPU SSSE3, SSE3 → lddqu ( , )

Faster whole SSE unallocated load, which is rarely used

More articles: