Where is the balance between thread count and thread lock time?

Elongated Question:
When you have more blocking threads, then the CPU cores, where is the balance between the number of threads and the block stream to maximize CPU efficiency by reducing the context switch overhead?

I have many I / O devices that I need to control in Windows 7, with an x64 multi-core processor: PCI devices, network devices, things that are stored on hard drives, large chunks of copied data. The most common policy: โ€œPut a stream on it!โ€. A few dozen threads later, it starts to feel like a bad idea.

None of my cores are 100% used, and there are several cores that are still idling, but there are delays in the range of 10 to 100 ms, which cannot be explained by IO locking or heavy CPU usage. Other processes also do not require resources. I suspect context overhead.

There are many possible solutions:

  • Reduce threads by combining the same I / O devices: this is mainly for the hard drive, but possibly for the network. If I save 20 MB on my hard drive in one stream and 10 MB in another, is it better to put all this in the same thing? How will this work if multiple hard drives are used?
  • Reduce flows by combining similar I / O devices and increase priority. Dozens of high priority threads are likely to cause the UI thread to stutter. But I can combine all these functions together into 1 or more threads and increase its priority.

Any case studies dedicated to solving such problems are greatly appreciated.

+1
source share
4 answers

First, it seems that these tasks should be performed using asynchronous I / O operations (IO Completion Ports, preferably), rather than with separate threads. Blocking threads is usually the wrong way to I / O.

Secondly, blocked threads should not affect context switching. The scheduler must juggle with all active threads, and therefore the presence of a large number of threads (not blocked) can slow down the context transition. But as long as most of your threads are blocked, they should not affect those that do not.

+3
source

10-100 ms with some cores inactive: this is not switching the context to itself, since the switch is an order of magnitude faster than these delays, even with an explicit swap and cache.

Async I / O won't help here. The kernel thread pools that implement ASIO also need to be scheduled or replaced, although this is faster than user-space threads, since the number of Wagner ring loops is less. Of course, I would go to ASIO if CPU loading became a problem, but it is not.

You do not have enough processor, so what is it? A lot of noise - lack of RAM? Excessive paging can lead to long delays. Where is your page file located? I pushed me from Drive C to another fast SATA drive.

PCI Width? Do you have several TV cards?

Disk Controller Flushing Activity - Do you have an SSD that is approaching capacity? This is always good for unexplained pauses. I get a weird pause, even though my 128G SSD has only 2/3.

I have never had a problem with contextual exchange time, and I have been writing multi-threaded applications for decades. Windows plans and quickly sends ready-made threads to the kernel. "A few dozen threads on their own (i.e., not everything works!) Are not a problem remotely - now I look at my TaskManger / performance, I have 1213 threads and there are no performance problems at all using the processor ~ 6% (application in the test it works in the background, bitTorrent, etc.) Firefox has 30 threads, VLC media player 27, my test application 23. No problem writing this message at all.

Given your problem with delays of 10-100 ms, I would be amazed if messing with thread priorities and / or changing the way you load your work into threads gives any improvement - something else stuffs your system (you donโ€™t have any Are there any drivers that I encoded? :).

Does perfmon help with any hints?

Rgds, Martin

+1
source

I donโ€™t think there is a convincing answer, and it probably depends on your OS; some work better than others. Still, delays in the range of 10 to 100 ms are not due to context switching itself (although they may be due to the characteristics of the algorithm planning). My experience under Windows is that I / O is very inefficient, and if you do I / O, of any type, you will block. And also that input / output by one process or thread will block other processes or threads. (For example, on Windows, it probably doesnโ€™t make sense to have more than one thread processing the hard drive. You cannot read or write multiple sectors at the same time, and I'm sure Windows does not optimize access like some other systems do.)

Regarding your exact questions:

"If I save 20 MB to the hard drive in one stream and 10 MB in another, would it be better if all this were published the same way?": It depends on the OS. As a rule, there should not be a reduction in time or latency using separate streams, and depending on other activity and OS, there may be an improvement. (If there are several disk requests in For example, most operating systems optimize access by reordering requests to reduce head movement.) The simplest solution would be to try both, and see which one works best on your system.

โ€œHow will this work if multiple hard drives are used?โ€: The OS should be able to do I / O in parallel if the requests are different drives.

As for increasing the priority of one or more hells, it is very OS dependent, but probably worth a try. If there is no significant processor, the time used in threads with a higher priority, it should not affect the user interface - these threads are mostly blocked for input / output, remember.

0
source

Ok, my Windows 7 is currently running 950 threads. I do not think that adding a few dozen more will be significant. However, you should definitely look at the thread pool or other device to steal work for this - you should not create new threads to block them. If Windows uses asynchronous I / O by default, use it.

0
source

All Articles