Linux 2.6.31 Scheduler and multithreaded jobs

I run massive parallel science jobs on a shared Linux machine with 24 cores. In most cases, my workstations can scale up to 24 cores when nothing is working on this computer. However, it seems that when one single-threaded job is running that doesn't work, my 24-thread jobs (which I set for high good values) can only get ~ 1800% of the CPU (using Linux notation). Meanwhile, about 500% of the processor cycles (again, using Linux notation) are idle. Can someone explain this behavior and what can I do with it to get all 23 cores that are not used by someone else?

Notes:

  • In case it matters, I watched it on several different versions of the kernel, although I can’t remember that from my head.

  • The processor architecture is x64. Is it possible that the fact that my 24-core jobs are 32-bit and other jobs that I compete with 64-bit matters?

Edit: One thing that I just noticed is that up to 30 threads seem to ease the issue to some extent. This comes to 2100% of the CPU.

+6
performance multithreading linux linux-kernel scheduler
source share
5 answers

Perhaps this is due to the fact that the scheduler is trying to save all your tasks on the same processor on which it was previously run (it does this because the task most likely brought its working set to this processor cache - this is " cache hot ").

Here are some ideas you can try:

  • Run twice as many threads as you have kernels;
  • Run one or two fewer threads than your kernel;
  • Reduce the value of /proc/sys/kernel/sched_migration_cost (possibly to zero);
  • Decrease the value of /proc/sys/kernel/sched_domain/.../imbalance_pct closer to 100.
+6
source share

Do you need to synchronize your threads? If so, you may experience the following problem:

Suppose you have a 4-processor system and a 4-stream job. When you run alone, threads get around to using all 4 cores and overall usage is almost perfect (we will call it 400%).

If you add one single-threaded interfering task, the scheduler can place 2 of your threads on one processor. This means that 2 of your threads are now working effectively at half their normal pace (drastic simplification), and if your threads need to be synchronized periodically, your work progress may be limited by the slowest thread, which in this case runs at half normal speed. You could use only 200% (of 4x 50% of work) plus 100% (interfering job) = 300%.

Similarly, if you assume that intervention in a task uses only 25% of one processor time, you can see one of your threads and a source of interference on one CPU. In this case, the slowest flow works at a normal speed of 3/4, resulting in a total use of 300% (4x 75%) + 25% = 325%. Play with these numbers, and it’s easy for you to find something similar to what you see.

If this is a problem, you can, of course, play with priorities to give nasty tasks only tiny fractions of the available CPU (I assume I / O delays are not a factor). Or, as you find, try increasing the threads so that each processor has, say, 2 threads, minus several, to solve system tasks. Thus, a system with 24 cores can work best, say, out of 46 threads (which always leave half of the 2 cores available for system tasks).

+2
source share

Do your threads connect to each other?

Try manually binding each thread to the processor using sched_setaffinity or pthread_setaffinity_np . A scheduler can be pretty dumb when dealing with a lot of related threads.

+1
source share

It might be worth using mpstat (part of sysstat ) to find out if you have entire processors while others are fully utilized. It should give you a more detailed idea of ​​usage than top or vmstat: run mpstat -P ALL to see 1 line per processor.

As an experiment, you can try to adjust the proximity of the processor to each thread so that each one is tied to a separate processor; this will let you see what performance is if you don't let the kernel planner decide which CPU the task is scheduled for. This is not a good permanent solution, but if it helps a lot, it gives you an idea of ​​where the planner falls.

0
source share

What do you think is the bottleneck in your application or kernel scheduling algorithm? Before setting up planning options, I suggest you try running a multi-threaded application to see if it exhibits the same behavior as your application.

 // COMPILE WITH: gcc threads.c -lpthread -o thread #include <pthread.h> #define NUM_CORES 24 void* loop_forever(void* argument) { int a; while(1) a++; } void main() { int i; pthread_t threads[NUM_CORES]; for (i = 0; i < NUM_CORES; i++) pthread_create(&threads[i], 0, loop_forever, 0); for (i = 0; i < NUM_CORES; i++) pthread_join(threads[i], 0); } 
0
source share

All Articles