I tried using OpenMP with one #pragma omp parallel for, and this led to my program moving from 35s (99.6% CPU) runtime to 14s (500% CPU) running on Intel (R) Xeon (R) CPU E3-1240 v3 @ 3.40GHz . This is the difference between compiling with g++ -O3and g++ -O3 -fopenmp, as with, gcc (Debian 4.7.2-5) 4.7.2on Debian 7 (wheezy).
Why does it use a maximum of 500% processor when the theoretical maximum is 800%, since the processor has 4 cores / 8 threads? Should it not reach at least the low 700s?
Why am I only getting a 2.5x improvement in total time, but at a cost of 5x in the CPU? Cache failed?
The whole program is based on C ++ stringmanipulation with recursive processing (using a large number .substr(1)and some concatenation), where the specified lines are continuously inserted into vectorof set.
In other words, basically there are about 2k loops running in one parallel for the loop running on vector, and each of them can do two recursive calls to itself w / some string .substr(1)and + char, and then the recursion is completed using set .insertone line or concatenation of two lines, and the mentioned set .insertalso takes care of the significant number of duplicates that are possible.
Everything works correctly and well inside the spec, but I'm trying to check if it can work faster. :-)