What is the required complexity of the function so as not to be connected by main memory?

I know that access to main memory has a high latency if data is not cached. This question is about bandwidth.

What is the required complexity of a function that should never be tied to main memory on a regular desktop PC?

I read about modern RAM with a bandwidth of 25-30 GB / s (DDR3 RAM, dual channel mode). As far as I can tell, one core of a modern Intel processor can store no more than 32 bytes per instruction using modern SIMD instruction sets. and he can carry out no more than 4 * 10 ^ 9 instructions. So efficiently, it can output about 120 GB / s. Given a processor with 8 threads, the maximum output amount will be around 960 GB / s as the worst estimate.

The processor can output no more than ~ 36 times the data that can be written to RAM. Is it possible to assume that any function performing operations without load and storing more than 36 cycles in one SIMD storage or loading (or more than 9 cycles in a regular 8-byte storage or loading) will never be associated with main memory? Can this score be significantly reduced or is it too low for some reason?

Given that I have:

X = (x_1, x_2, ..., x_n) // dataset, large enough to make good use of caches a(x), b(x), c(x, y), d(x) := c(a(x), b(x)) // functions that operate on elements A(x) := (a(x_1), a(x_2), ..., a(x_n)) // functions that operate on data sets 

I am looking for recommendations when it is better (or not worse) to implement

 D(X) 

but

 C(A(X), B(X)) 

given that the first implementation puts more pressure on caches and registers, and the second implementation has more load / store operations.

(Of course, you can tell me about the benchmarks, I am fine with that, but sometimes I just want to make an educated guess and just review the material when it becomes a problem or a bottle of neck later).

+7
c ++ performance c assembly x86
source share
1 answer

I think it very much depends on whether the code is written in such a way that the CPU can pre-select the next data item in the cache. If it pre-enters incorrect data, you will still be associated with memory no matter how much time you spend processing current data.

And if you have several threads writing to the same address (their data will be in different lines of the cache), then even if it was pre-programmed, if another thread wrote to this address, then it should be reset, read it again from main memory.

In general, I think that at this level it is impossible to talk about these things, and this will depend on what you have.

+1
source share

All Articles