Will 8 logical threads on 4 cores run 4 times faster in parallel?

I am testing software that runs 4 times faster on Intel 2670QM, and then on my serial version, using all 8 of my “logical” threads. I would like some community feedback on my perception of benchmarking results.

When I use 4 threads on 4 cores, I get 4 times acceleration, the whole algorithm runs in parallel. This seems logical to me, as Amdgal’s law predicts this. Windows Task Manager tells me that I use 50% of the processor.

However, if I run the same software on all 8 threads, I get again 4x speed, not 8x speed.

If I understood it correctly: my processor has 4 cores with a frequency of 2.2 GHz separately, but the frequency is divided by 1.1 GHz when applied to 8 "logical" streams, and for other components, such as: cache memory? If so, why does the task manager claim that only 50% of my processor is used?

#define NumberOfFiles 8 ... char startLetter ='a'; #pragma omp parallel for shared(startLetter) for(int f=0; f<NumberOfFiles; f++){ ... } 

I do not include time using disk I / O. I'm only interested in the time that an STL call (STL sorting) takes, not the I / O disk.

+7
source share
5 answers

A i7-2670QM processor has 4 cores. But it can run 8 threads in parallel. This means that it has only 4 processors (Core), but it has hardware support to run 8 threads simultaneously. This means that a maximum of four tasks are launched on the kernels if one of the tasks, due to, for example, access to the memory of another thread, can very quickly start execution on a free kernel with a very small penalty. Learn more about Hyper Stream . In Reality, there are several scenarios where hyperthreading gives great performance. More modern processors handle hyperthreads better than older processors.

Your benchmark showed that it is connected to the CPU, i.e. There were few stalls in the assembly line that provided Hyper Threading an edge. A 50% processor is correct, 4 cores work, and 4 additional ones do nothing. Hyper-thread rotation in the BIOS, and you will see a 100% processor.

+11
source

This is a short description of Hyperthreading / HyperTransport

Thread switching is slow, you have to stop execution, copy a bunch of values ​​to memory, copy a bunch of values ​​from memory to the CPU, and then start all over with a new thread.

Here you can find 4 virtual cores. You have 4 cores, but this is what hyperthreading does, allowing the processor to do 2 threads on one core.

Only one thread can run at a time, however, when 1 thread needs to stop to do memory access, disk access, or something else that takes some time, it can switch to another thread and start it for a bit. On older processors, there was mostly a bit of sleep at this time.

So, your quad core has 4 cores, which can do 1 piece at a time, but can have a 2nd task in standby mode, as soon as they need to wait on another part of the computer.

If your task has a lot of memory usage and a lot of CPU usage, you should see a slight decrease in the overall execution time, but if you are almost completely connected with the CPU, you would be better off sticking to only 4 threads

+6
source

An important piece of understanding information here is the distinction between physical and logical flow.
If your processor has 4 physical cores, this means that you have the physical resources to execute 4 separate threads of execution in parallel. So, if there are no data conflicts in your threads, you can usually measure the increase in x4 performance compared to the speed of a single thread.
I also assume that the OS (or you :)) correctly sets the affinity of the thread, so every thread runs on every physical core.
When you enable HT (Hyper-Threading) on ​​your CPU, the core frequency does not change. :)
It happens that part of the hw pipeline (inside the kernel and around (uncore, cache, etc.)) is duplicated, but part of this is still shared between logical threads. This is the reason you are not measuring the increase in x8 performance. In my experience, allowing you to use all logical cores, you can get a performance improvement of x1.5 - x1.7 per physical core, depending on the code you are running, cache usage (remember that L1 cache is shared between two logical cores / 1 by the physical core , for example), affinity of threads, etc. etc. Hope this helps.

+5
source

Some real numbers:

The task with intensive processor usage on my i7 (adding numbers from 1-1000000000 to int var, 16 times), averaged over 8 tests:

Summary, streams / ticks:

 1/26414 4/8923 8/6659 12/6592 16/6719 64/6811 128/6778 

Please note that the line “using X threads” in the reports below contains more X than the number of threads available for tasks - one thread sends tasks and waits for the subnet to finish with entries to complete them - it does not process any CPU-heavy tasks and does not use a processor.

 8 tests, 16 tasks, counting to 1000000000, using 2 threads: Ticks: 26286 Ticks: 26380 Ticks: 26317 Ticks: 26474 Ticks: 26442 Ticks: 26426 Ticks: 26474 Ticks: 26520 Average: 26414 ms 8 tests, 16 tasks, counting to 1000000000, using 5 threads: Ticks: 8799 Ticks: 9157 Ticks: 8829 Ticks: 9002 Ticks: 9173 Ticks: 8720 Ticks: 8830 Ticks: 8876 Average: 8923 ms 8 tests, 16 tasks, counting to 1000000000, using 9 threads: Ticks: 6615 Ticks: 6583 Ticks: 6630 Ticks: 6599 Ticks: 6521 Ticks: 6895 Ticks: 6848 Ticks: 6583 Average: 6659 ms 8 tests, 16 tasks, counting to 1000000000, using 13 threads: Ticks: 6661 Ticks: 6599 Ticks: 6552 Ticks: 6630 Ticks: 6583 Ticks: 6583 Ticks: 6568 Ticks: 6567 Average: 6592 ms 8 tests, 16 tasks, counting to 1000000000, using 17 threads: Ticks: 6739 Ticks: 6864 Ticks: 6599 Ticks: 6693 Ticks: 6676 Ticks: 6864 Ticks: 6646 Ticks: 6677 Average: 6719 ms 8 tests, 16 tasks, counting to 1000000000, using 65 threads: Ticks: 7223 Ticks: 6552 Ticks: 6879 Ticks: 6677 Ticks: 6833 Ticks: 6786 Ticks: 6739 Ticks: 6802 Average: 6811 ms 8 tests, 16 tasks, counting to 1000000000, using 129 threads: Ticks: 6771 Ticks: 6677 Ticks: 6755 Ticks: 6692 Ticks: 6864 Ticks: 6817 Ticks: 6849 Ticks: 6801 Average: 6778 ms 
+1
source

HT in most BIOS addresses is called SMT (simultaneous MultiThreading) or HTT (HyperThreading technology). The effectiveness of HT depends on the so-called "computation to sampling" relationship, which is the number of kernel operations (or registers / caches) that your code executes before it extracts or stores memory in slow main memory or I / O memory. For high-performance cache codes and CPU-bound codes, NT practically does not increase a noticeable increase in performance. For more memory-bound codes, HT can really benefit execution because of the so-called “delay latency". That is why most processors for servers other than x86 provide 4 (for example, IBM POWER7) to 8 (for example, UltraSPARC T4) hardware threads per core. These processors are commonly used in data processing and transaction systems, where many concurrent memory-related requests are simultaneously processed.

By the way, Amdhal’s law states that the upper limit of parallel acceleration is one over the sequential part of the code. Typically, the serial fraction increases with the number of processing elements if there is (possibly hidden at runtime) a connection or other synchronization between threads, although sometimes cache effects can lead to super-linear acceleration, and sometimes caching can significantly reduce performance.

+1
source

All Articles