Effective memory bandwidth for streaming

Question

Effective memory bandwidth for streaming

I have an application that transfers 250 MB of data using a simple and fast neural network threshold function for data blocks (2 32-bit words in total). Based on the result of a (very simple) calculation, a piece is unpredictably inserted into one of 64 bins. So one big stream in and 64 shorter (variable length) flows.

This is repeated many times with various detection functions.

Calculation is limited by bandwidth. I can say this because there is no change in speed, even if I use the discriminant function, which is much more computationally intensive.

What is the best way to structure recording new threads to optimize memory bandwidth? I especially think that understanding cache usage and cache line size can play a big role in this. Imagine the worst case where I have 64 output streams and, unfortunately, many are mapped to the same cache line. Then, when I write the next 64 bits of data to the stream, the CPU must pop the obsolete cache line into main memory and load it into the corresponding cache line. Each of them uses 64 bytes of bandwidth ... so a limited bandwidth application can spend 95% of the memory bandwidth (although in this hypothetical worst case).

It is difficult to even try to measure the effect, so the development of the paths around it is even more vague. Or am I even chasing a bottleneck in a ghost that is somehow optimizing the hardware better than I could?

I use Core II x86 processors if that matters.

Edit: Here is a sample code. It passes through an array and copies its elements to various output arrays selected pseudo-randomly. Running the same program with different numbers of destination cells gives different time intervals, although the same amount of calculations and reading and writing in memory were done:

2 output streams: 13 seconds
8 output streams: 13 seconds
32 output streams: 19 seconds
128 output streams: 29 seconds
512 output streams: 47 seconds

512 2 4X (,), .

#include <stdio.h>
#include <stdlib.h>
#include <ctime>

int main()
{
  const int size=1<<19;
  int streambits=3;
  int streamcount=1UL<<streambits; // # of output bins
  int *instore=(int *)malloc(size*sizeof(int));
  int **outstore=(int **)malloc(streamcount*sizeof(int *));
  int **out=(int **)malloc(streamcount*sizeof(int));
  unsigned int seed=0;

  for (int j=0; j<size; j++) instore[j]=j;

  for (int i=0; i< streamcount; ++i) 
    outstore[i]=(int *)malloc(size*sizeof(int));

  int startTime=time(NULL);
  for (int k=0; k<10000; k++) {
    for (int i=0; i<streamcount; i++) out[i]=outstore[i];
    int *in=instore;

    for (int j=0; j<size/2; j++) {
      seed=seed*0x1234567+0x7162521;
      int bin=seed>>(32-streambits); // pseudorandom destination bin
      *(out[bin]++)=*(in++);
      *(out[bin]++)=*(in++);
    }

  }
  int endTime=time(NULL);
  printf("Eval time=%ld\n", endTime-startTime);
}

+5

optimization cpu-cache streaming

SPWorley 02 . '09 11:17

5

""? "", threshhold , , , , (1 ), , .

, , , . , (1 ), , 8 .

, , , , , , , , ..

, , - . , , .

+3

Sniggerfardimungus 06 . '09 21:01

, ...

, . , , , , i7-. , AMD, , Core 2 ( ).

, , - CUDA. . 5x 20x CUDA C.

+2

Mr Fooz 06 . '09 20:42

, , . , . , . , , Oses -.

, ACE (http://www.cs.wustl.edu/~schmidt/ACE.html) Boost (http://www.boost.org) , .

+1

lothar 06 . '09 20:25

, . , , . , , .

: ( ), ints, = 1 < 19 sizeof (int) = 4, 32- - .. 8 , 8 , 8 . , WB (WriteBack) x86, , - .

RFO- , () WC (, ) () SSE, NT (Non-Temporal) Stores. MOVNT * - MOVNTQ, MOVNTPS .. ( MOVNTDQA, .)

, googling http://blogs.fau.de/hager/2008/09/04/a-case-for-the-non-temporal-store/

: MOVNT * WB-, WC, cmbining. : Intel 4, P6 ( Pentium Pro). Ooof... Bulldozer 4K WCC (Write Combining Cache) 64 , http://semiaccurate.com/forums/showthread.php?t=6145&page=40, 4 WC . http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf , processos 6 WC 8. ... , . 64.

-, : , .

a) 64 (#stream) , 64B ( ), , , 128 256B. WB. , , MOVNT *, .

, , , . MOVNT *.

, * N , , L1 * 64 * 64 * N , , L1. * N , - , .

I.e N + N + N

Compared to N bytes of cache, skip reading + N bytes to read a record.

Reducing N bytes of read missed caches can lead to additional overhead.

+1

Krazy glew Nov 06 '12 at 6:10

source share

MSalters · Accepted Answer · 2009-04-02T14:52:10+0000

, 64 , . , , , . ; Core 2 L1 8- . , . 65 (1 /64 ), 8- .

L2, -, 12- ( 3/6 , 12 - ). L1, , .

, , . , , . bin 0 0-15 0-63, 16-31 8192-8255. bin 1 0-15 64-127 .. , , 8 .

SSE4, x64. 16 x 128 , (MOVNTDQA), . , , , Core2 Prefetcher . , .

Effective memory bandwidth for streaming

More articles: