OpenMP: do not use hyperthreads (half of `num_threads ()` w / hyperthreading)

Q Is OpenMP (parallel for) in g ++ 4.7 not very efficient? 2.5x to 5x CPU , I determined that the performance of my program varies from 11 to 13 (mostly always above 12 seconds, and sometimes slower than 13.4 s) with about 500% CPU when using the standard one #pragma omp parallel forand OpenMP speed is only 2.5x at 5x CPU w / g++-4.7 -O3 -fopenmp, on a 4-core 8-thread Xeon.

I tried to use schedule(static) num_threads(4)and noticed that my program always exits from 11.5 to 11.7 (always below 12 s) at a frequency of about 320%, for example, it works more consistently and uses fewer resources (even if a better start is half a second slower than rare outburst with hyperthread).

Is there any simple OpenMP way to detect hyperthreads and reduce num_threads()the actual number of CPU cores?

(There is a similar question. Poor performance due to hyperthreading using OpenMP: how to associate threads with kernels , but when testing, I found that a simple reduction of 8 to 4 threads somehow already does this work w / g ++ - 4.7 on Debian 7 wheezy and Xeon E3-1240v3, so this question is only related to reduction num_threads()to the number of cores.)

+4
source share
2 answers

Linux [ x86], /proc/cpuinfo. cpu cores siblings. - [] , - . (, 4 8 ).

Linux [ Zulan], x86 cpuid.

: OMP_NUM_THREADS, launcher/wrapper script

, , , , , , [ ] , , .

: CAS, - CppCon 2015, : https://www.youtube.com/watch?v=lVBvHbJsg5Y https://www.youtube.com/watch?v=1obZeHnAwz4

1,5 , , , .

[ / ] , / .

+2

Hyper-Threading - Intel (SMT). AMD SMT ( Bulldozer - , AMD , Zen SMT). OpenMP SMT.

Hyper-Threading, , - Intel, AMD. .

OpenMP, Intel, .

Intel ( Intel, ). , . GCC export OMP_PROC_BIND=true ( ).

, , VirtualBox. VirtualBox /8 Core Windows Linux VM 4 , 2 /proc/cpuinfo , .

#include <stdio.h>

//cpuid function defined in instrset_detect.cpp by Agner Fog (2014 GNU General Public License)
//http://www.agner.org/optimize/vectorclass.zip

// Define interface to cpuid instruction.
// input:  eax = functionnumber, ecx = 0
// output: eax = output[0], ebx = output[1], ecx = output[2], edx = output[3]
static inline void cpuid (int output[4], int functionnumber) {
#if defined (_MSC_VER) || defined (__INTEL_COMPILER)       // Microsoft or Intel compiler, intrin.h included

  __cpuidex(output, functionnumber, 0);                  // intrinsic function for CPUID

#elif defined(__GNUC__) || defined(__clang__)              // use inline assembly, Gnu/AT&T syntax

  int a, b, c, d;
  __asm("cpuid" : "=a"(a),"=b"(b),"=c"(c),"=d"(d) : "a"(functionnumber),"c"(0) : );
  output[0] = a;
  output[1] = b;
  output[2] = c;
  output[3] = d;

#else                                                      // unknown platform. try inline assembly with masm/intel syntax

  __asm {
    mov eax, functionnumber
      xor ecx, ecx
      cpuid;
    mov esi, output
      mov [esi],    eax
      mov [esi+4],  ebx
      mov [esi+8],  ecx
      mov [esi+12], edx
      }

  #endif
}

int getNumCores(void) {
  //Assuming an Intel processor with CPUID leaf 11
  int cores = 0;
  #pragma omp parallel reduction(+:cores)
  {
    int regs[4];
    cpuid(regs,11);
    if(!(regs[3]&1)) cores++;
  }
  return cores;
}

int main(void) {
  printf("cores %d\n", getNumCores());
}
0

All Articles