UPD 02/12/2015 . I did some experiments.
Theory
There is an obvious decision to change the "scheduler" thread scheduler to RT (a real-time scheduler that provides SCHED_DEADLINE / SCHED_FIFO policies). In this case, the “manager” threads will always have a higher priority than most threads in the system, so they will almost always get the processor when they need it.
However, there is another solution that allows you to stay on the CFS scheduler. Your description of the purpose of the "work" flows is similar to batch planning (in ancient times, when computers were large, the user must queue his work and wait for hours before it is completed). Linux CFS supports batch jobs through the SCHED_BATCH policy and dialog jobs through the SCHED_NORMAL policy.
There is a useful comment in the kernel code ( kernel / sched / fair.c ):
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) return;
Therefore, when the “manager” thread or some other event is activated by the “worker”, the processor only receives the latter if the system has free processors or when the “manager” runs out of its time interval (to configure it by changing the weight of the task) .
It seems that your problem cannot be solved without changing the scheduler policies. If the "worker" threads are very busy, and the "manager" rarely wake up, they will receive the same vruntime bonus, so the "worker" will always supersede the "control" threads (but you can increase their weight so that they exhaust their bonus faster).
Experiment
I have a server with 2 x Intel Xeon E5-2420 processors that gives us 24 hardware threads. To simulate two thread pools, I used my own TSLoad workload generator (and fixed a couple of errors when doing experiments :)).
There were two thread pools: tp_manager with 4 threads and tp_worker with 30 threads and busy_wait workloads (just for(i = 0; i < N; ++i); ), but with a different number of loop cycles. tp_worker works in benchmark mode, so it will run as many requests as it can, and takes up 100% of the CPU.
Here is a config example: https://gist.github.com/myaut/ad946e89cb56b0d4acde
3.12 (vanilla with debug configuration)
EXP | MANAGER | WORKER | sched wait service | sched service | policy time time | policy time 33 | NORMAL 0.045 2.620 | WAS NOT RUNNING 34 | NORMAL 0.131 4.007 | NORMAL 125.192 35 | NORMAL 0.123 4.007 | BATCH 125.143 36 | NORMAL 0.026 4.007 | BATCH (nice=10) 125.296 37 | NORMAL 0.025 3.978 | BATCH (nice=19) 125.223 38 | FIFO (prio=9) -0.022 3.991 | NORMAL 125.187 39 | core:0:0 0.037 2.929 | !core:0:0 136.719
3.2 (Debian stock)
EXP | MANAGER | WORKER | sched wait service | sched service | policy time time | policy time 46 | NORMAL 0.032 2.589 | WAS NOT RUNNING 45 | NORMAL 0.081 4.001 | NORMAL 125.140 47 | NORMAL 0.048 3.998 | BATCH 125.205 50 | NORMAL 0.023 3.994 | BATCH (nice=10) 125.202 48 | NORMAL 0.033 3.996 | BATCH (nice=19) 125.223 42 | FIFO (prio=9) -0.008 4.016 | NORMAL 125.110 39 | core:0:0 0.035 2.930 | !core:0:0 135.990
Some notes:
- Time in milliseconds
- The last experiment is designed to determine affinity (recommended by @ PhilippClaßen): manager threads are bound to Core # 0, while worker threads are bound to all cores except Core # 0.
- Maintenance time for manager flows has doubled, due to concurrency by internal cores (the processor has Hyper-Threading!)
- Using SCHED_BATCH + nice (TSLoad cannot set direct weight, but
nice can do this indirectly) slightly reduces the latency. - The negative wait time in the SCHED_FIFO experiment is OK: TSLoad reserves 30us, so it can do preliminary work / time scheduler to make a context switch / etc. SCHED_FIFO seems to be very fast.
- Reserving a single core is not so bad, and since it is removed in the concurrency core, maintenance time has been significantly reduced.