What is the best way to demonstrate the impact of proximity settings?

As soon as I noticed that Windows does not support intensive computational flows in a specific kernel - it switches kernels instead. Therefore, I assumed that the work would be completed faster if the stream maintained access to the same data caches. Indeed, I was able to observe a steady improvement in speed by 1% after installing the thread affinity mask on one core (in the ppmd (de) compression stream). But then I tried to create a simple demonstration for this effect and more or less failed - that is, it works as expected on my system (Q9450):

  buflog = 21 bufsize = 2097152
 (cache flush) first run = 6.938s
 time with default affinity = 6.782s
 time with first core only = 6.578s
 speed gain is 3.01%

but the people I asked for could not reproduce the effect. Any suggestions?

#include <stdio.h> #include <windows.h> int buflog=21, bufsize, bufmask; char* a; char* b; volatile int r = 0; __declspec(noinline) int benchmark( char* a ) { int t0 = GetTickCount(); int i,h=1,s=0; for( i=0; i<1000000000; i++ ) { h = h*200002979 + 1; s += ((int&)a[h&bufmask]) + ((int&)a[h&(bufmask>>2)]) + ((int&)a[h&(bufmask>>4)]); } r = s; t0 = GetTickCount() - t0; return t0; } DWORD WINAPI loadcore( LPVOID ) { SetThreadAffinityMask( GetCurrentThread(), 2 ); while(1) benchmark(b); } int main( int argc, char** argv ) { if( (argc>1) && (atoi(argv[1])>16) ) buflog=atoi(argv[1]); bufsize=1<<buflog; bufmask=bufsize-1; a = new char[bufsize+4]; b = new char[bufsize+4]; printf( "buflog=%i bufsize=%i\n", buflog, bufsize ); CreateThread( 0, 0, &loadcore, 0, 0, 0 ); printf( "(cache flush) first run = %.3fs\n", float(benchmark(a))/1000 ); float t1 = benchmark(a); t1/=1000; printf( "time with default affinity = %.3fs\n", t1 ); SetThreadAffinityMask( GetCurrentThread(), 1 ); float t2 = benchmark(a); t2/=1000; printf( "time with first core only = %.3fs\n", t2 ); printf( "speed gain is %4.2f%%\n", (t1-t2)*100/t1 ); return 0; } 

PS I can post a link to a compiled version if someone needs it.

+4
source share
4 answers

default affinity : default affinity http://nishi.dreamhosters.com/u/paf_a0.png

affinity for core # 4 affinity for core # 4 http://nishi.dreamhosters.com/u/paf_a8.png

Now it is an archiver. Do you really think that the working thread goes all around the processor well?

+3
source

Perhaps you were just lucky, but on other computers where you tested the program, someone did the same thing as you, but his thread slept a lot.

This will interrupt your program from time to time when another thread is assigned.

0
source
  • Windows does not intentionally change processes between processors. If this happened to you, you were just out of luck.
  • You can get small breaks in speed, if you get a lot of cache hits, it depends on your application. (If you don’t have a lot of hardware with a frightened NUMA memory architecture, this can cause all kinds of dependencies).
  • In your case, why not just increase the priority of the process so that it never exchanges with the CPU?
0
source

How do you know that the remaining 3 cores are used by your thread, and not by some system threads? For example, if you are swap or something else. Set up some performance counters on your process in perfmon and check this assumption.

0
source

Source: https://habr.com/ru/post/1316144/


All Articles