Why is this OpenMP program slower than single-threaded?

Question

Why is this OpenMP program slower than single-threaded?

Take a look at this code.

Single threaded program: http://pastebin.com/KAx4RmSJ . Compiled with

g ++ -lrt -O2 main.cpp -o nnlv2

Multithreading with openMP: http://pastebin.com/fbe4gZSn Compiled with

g ++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp

I tested it on a dual core system (so that we have two threads running in parallel). But the multi-threaded version is slower than the single-threaded version (and shows unstable time, try to run it several times). What's wrong? Where did I make a mistake?

Some tests:

One-way:

Layers Neurons Inputs --- Time (ns) 10 200 200 --- 1898983 10 500 500 --- 11009094 10 1000 1000 --- 48116913

Multithreaded:

 Layers Neurons Inputs --- Time (ns) 10 200 200 --- 2518262 10 500 500 --- 13861504 10 1000 1000 --- 53446849

I do not understand what is wrong.

+4

c ++ optimization multithreading openmp single-threaded

Robotex Jul 12 '11 at 21:34

source share

4 answers

Ben voigt · Answer 1 · 2011-07-13T00:27:15+0000

Is your goal here to learn OpenMP, or make your program faster? If the latter, it would be more advisable to write code that was repeatedly added, reduce the number of passes and enable SIMD.

Step 1: Combine the loops and use multiply-add:

 // remove the variable 'temp' completely for(int i=0;i<LAYERS;i++) { for(int j=0;j<NEURONS;j++) { outputs[j] = 0; for(int k=0,l=0;l<INPUTS;l++,k++) { outputs[j] += inputs[l] * weights[i][k]; } outputs[j] = sigmoid(outputs[j]); } std::swap(inputs, outputs); }

karshan · Answer 2 · 2011-07-20T04:43:37+0000

compiling with -static and -p, starting and then parsing gmon.out with gprof I got:

45.65% gomp_barrier_wait_end

This is a lot of time in the opemmp barrier routine. this is the time spent waiting for the rest of the threads to finish. since you execute parallelism for loops many times (LAYERS), you lose the advantage of parallel operation, since every time a loop for a loop is completed, there is an implicit barrier call that will not be returned until all other threads have finished.

Daniel Mošmondor · Answer 3 · 2011-07-12T23:57:28+0000

First of all, run the test in a multi-threaded configuration and MAKE SURE that procexp or task manager will show you 100% CPU utilization. If this is not the case, then you are not using multiple threads or multiple processor cores.

Also taken from the wiki:

Environment Variables

A way to change the execution functions of OpenMP applications. Used to control the planning of loop iterations, the number of default threads, etc. For example, OMP_NUM_THREADS is used to indicate the number of threads for an application.

jheriko · Answer 4 · 2011-07-13T01:41:40+0000

I don't see where you actually used OpenMP - try #pragma omp in parallel over the main path ... ( documented here , for example)

Slowness may include OpenMP and initialization, adding code flips, or otherwise changing compilation as a result of the compiler flags that you entered to enable it. Alternatively, the hinges are so small and simple that the overhead of threading far exceeds performance.

Why is this OpenMP program slower than single-threaded?

More articles: