I know that access to main memory has a high latency if data is not cached. This question is about bandwidth.
What is the required complexity of a function that should never be tied to main memory on a regular desktop PC?
I read about modern RAM with a bandwidth of 25-30 GB / s (DDR3 RAM, dual channel mode). As far as I can tell, one core of a modern Intel processor can store no more than 32 bytes per instruction using modern SIMD instruction sets. and he can carry out no more than 4 * 10 ^ 9 instructions. So efficiently, it can output about 120 GB / s. Given a processor with 8 threads, the maximum output amount will be around 960 GB / s as the worst estimate.
The processor can output no more than ~ 36 times the data that can be written to RAM. Is it possible to assume that any function performing operations without load and storing more than 36 cycles in one SIMD storage or loading (or more than 9 cycles in a regular 8-byte storage or loading) will never be associated with main memory? Can this score be significantly reduced or is it too low for some reason?
Given that I have:
X = (x_1, x_2, ..., x_n)
I am looking for recommendations when it is better (or not worse) to implement
D(X)
but
C(A(X), B(X))
given that the first implementation puts more pressure on caches and registers, and the second implementation has more load / store operations.
(Of course, you can tell me about the benchmarks, I am fine with that, but sometimes I just want to make an educated guess and just review the material when it becomes a problem or a bottle of neck later).
c ++ performance c assembly x86
Markus mayr
source share