How to estimate the switching overhead?

I am trying to improve the performance of a streaming application with deadlines in real time. It runs on Windows Mobile and is written in C / C ++. I have a suspicion that a high switching frequency can cause tangible overhead, but can neither prove nor disprove it. As you know, the lack of evidence is not evidence of the opposite :).

So my question is twofold:

  • If at all, where can I find any actual measurements of the flow context switching cost?

  • Without wasting time recording a test application, what are the ways to estimate the cost of switching flows in an existing application?

  • Does anyone know how to find out the number of context switches (on / off) for a given stream?

+52
c ++ c multithreading windows-mobile
Nov 20 '08 at 9:21
source share
8 answers

While you said you didnโ€™t want to write a test application, I did this for a previous test on the ARM9 Linux platform to find out what overhead is. These were just two threads that would increase :: thread :: yield () (or, you know) and increase some variable, and after a minute or so (without other running processes, at least no one who what does), the application is printed how many context switches it can do per second. Of course, this is not very accurate, but the fact is that both threads brought the processor to each other, and it was so fast that it simply did not make sense to think about overhead. So, just go ahead and just write a simple test, instead of thinking too much about a problem that may not exist.

In addition, you can try, for example, 1800 using performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with heavy time switching and never encountered performance issues. We also tried to implement the main task of streaming without streams, and did not see a performance improvement (the graphical interface only reacted much more slowly, but everything else was the same). Perhaps you can try the same thing by reducing the number of context switches or completely removing threads (for testing only).

+13
Nov 20 '08 at 9:37
source share

I doubt that you can find this waybill anywhere on the Internet for any existing platform. There are too many different platforms. Overhead costs depend on two factors:

  • CPU, because the necessary operations can be simpler or more complex in different types of processors.
  • The core of the system, since different kernels must perform different operations on each switch

Other factors include how the switch occurs. The switch may take place when

  • the stream used its entire quantum of time. When a thread starts, it can run for a certain amount of time before it needs to return control to the kernel, which will decide who is next.

  • the stream has been unloaded. This occurs when another thread requires processor time and has a higher priority. For example. a thread that handles mouse / keyboard input can be such a thread. Regardless of which thread belongs to the processor right now, when the user types or presses something, he does not want to wait until the quantum of the current time stream is completely exhausted, he wants the system to respond immediately. Thus, some systems will immediately stop the current thread and return control to another thread with a higher priority.

  • the thread no longer requires CPU time, since it blocks some operation or just sleep () (or similar) is called to stop execution.

These three scenarios may have different flow switching times in theory. For example. I expect the latter to be the slowest, since the call to sleep () means that the CPU is returning to the kernel, and the kernel needs to be set up to wake up so that the thread wakes up after the amount of time it requested to sleep , then he must withdraw the stream from the planning process, and as soon as the stream wakes up, he must again add the stream to the planning process. All these steepnesses will require a certain amount of time. Thus, the actual sleep call may be longer than the time required to switch to another thread.

I think if you want to know for sure, you should be guided. The problem is that you usually have to either set the threads or synchronize them using mutexes. Sleeping or locking / unlocking mutexes is itself overhead. This means that your test will also include this overhead. Without a powerful profiler, it is difficult to say how much processor time was used for the actual switch and how much to call sleep / mutex. On the other hand, in a real-life scenario, your threads will either sleep or synchronize through locks. A test that purely measures context switching time is a synthetic benchmark because it does not simulate a real-life scenario. Tests are much more "realistic" if they are based on real-world scenarios. What is the use of the GPU test, which tells me that my GPU can theoretically process 2 billion polygons per second if this result is never achieved in a real 3D application? Wouldnโ€™t it be much more interesting to know how many polygons in a real 3D application can a second second GPU have?

Unfortunately, I don't know anything about Windows programming. I could write a Windows application in Java, or possibly C #, but C / C ++ on Windows makes me cry. I can offer you only the source code for POSIX.

#include <stdlib.h> #include <stdint.h> #include <stdio.h> #include <pthread.h> #include <sys/time.h> #include <unistd.h> uint32_t COUNTER; pthread_mutex_t LOCK; pthread_mutex_t START; pthread_cond_t CONDITION; void * threads ( void * unused ) { // Wait till we may fire away pthread_mutex_lock(&START); pthread_mutex_unlock(&START); pthread_mutex_lock(&LOCK); // If I'm not the first thread, the other thread is already waiting on // the condition, thus Ihave to wake it up first, otherwise we'll deadlock if (COUNTER > 0) { pthread_cond_signal(&CONDITION); } for (;;) { COUNTER++; pthread_cond_wait(&CONDITION, &LOCK); // Always wake up the other thread before processing. The other // thread will not be able to do anything as long as I don't go // back to sleep first. pthread_cond_signal(&CONDITION); } pthread_mutex_unlock(&LOCK); //To unlock } int64_t timeInMS () { struct timeval t; gettimeofday(&t, NULL); return ( (int64_t)t.tv_sec * 1000 + (int64_t)t.tv_usec / 1000 ); } int main ( int argc, char ** argv ) { int64_t start; pthread_t t1; pthread_t t2; int64_t myTime; pthread_mutex_init(&LOCK, NULL); pthread_mutex_init(&START, NULL); pthread_cond_init(&CONDITION, NULL); pthread_mutex_lock(&START); COUNTER = 0; pthread_create(&t1, NULL, threads, NULL); pthread_create(&t2, NULL, threads, NULL); pthread_detach(t1); pthread_detach(t2); // Get start time and fire away myTime = timeInMS(); pthread_mutex_unlock(&START); // Wait for about a second sleep(1); // Stop both threads pthread_mutex_lock(&LOCK); // Find out how much time has really passed. sleep won't guarantee me that // I sleep exactly one second, I might sleep longer since even after being // woken up, it can take some time before I gain back CPU time. Further // some more time might have passed before I obtained the lock! myTime = timeInMS() - myTime; // Correct the number of thread switches accordingly COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime); printf("Number of thread switches in about one second was %u\n", COUNTER); return 0; } 

Exit

 Number of thread switches in about one second was 108406 

More than 100,000 is not so bad, and that even if we have a lock and conditional expectations. I would suggest that without all this, at least twice as many thread switching was possible.

+24
Nov 20 '08 at 10:46
source share

You cannot appreciate it. You need to measure it. And it will vary depending on the processor in the device.

There are two fairly simple ways to measure the context switch. One includes code, the other does not.

Firstly, the code path (pseudocode):

 DWORD tick; main() { HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...); tick = QueryPerformanceCounter(); CeSetThreadPriority(hThread, 10); // real high ResumeThread(hThread); Sleep(10); } ThreadProc() { tick = QueryPerformanceCounter() - tick; RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick)); } 

Obviously, doing this in a loop and averaging would be better. Keep in mind that this is not just a context switch dimension. You also measure the call to ResumeThread, and there is no guarantee that the scheduler will immediately switch to your other thread (although priority 10 should help increase the likelihood that it will).

You can get a more accurate measurement using CeLog by connecting to the events of the scheduler, but it is far from easy to do and not very well documented. If you really want to go this route, Sue Low has several blogs that a search engine can find.

The non-coding route will be using the Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. This will give a graphical representation of everything the kernel does, and you can directly measure the thread context switch with the cursor options provided. Again, I'm sure Sue has a blog post on using Kernel Tracker.

All that said, you will find that the context switches for the in-process software flow are actually very fast. This is a process that is expensive because it requires the replacement of an active process in RAM and subsequent transfer.

+14
Nov 20 '08 at 14:07
source share

My 50 lines C ++ for Linux (QuadCore Q6600) context switching time ~ 0.9us (0.75us for 2 threads, 0.95 for 50 threads). In these tests, threads immediately generate revenue when they receive a quantum of time.

+7
May 15 '10 at 14:50
source share

I only ever tried to rate it once, and it was at 486! The result was that the processor context switch performed about 70 instructions (note that this happened for many api calls for the OS, as well as for switching threads). We estimate that on DX3 it consumes about 30% on a stream switch (including OS overhead). The several thousand context switches we did per second absorbed between 5-10% of the processor time.

How does this translate to a multi-core multiprocessor processor, which I donโ€™t know, but I would have guessed that if you didnโ€™t completely go over when the thread switches it to a little overhead.

Note that thread creation / deletion is a more expensive CPU / OS hogger than thread activation / deactivation. A good policy for high-threaded applications is to use thread pools and activate / deactivate as needed.

+5
Nov 20 '08 at 12:16
source share

The context switch is expensive because this rule costs 30 microseconds of CPU overhead http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

+5
Aug 19 '11 at 8:21
source share

The problem with context switches is that they have a fixed time. The GPU implemented 1 cycle of context switching between threads. The following, for example, cannot be numbered on processors:

 double * a; ... for (i = 0; i < 1000; i ++) { a[i] = a[i] + a[i] } 

because its execution time is much less than the cost of a context switch. On Core i7, this code takes about 1 microsecond (depending on the compiler). Thus, context switching time matters because it determines how small jobs can be streamed. I guess this also provides a method for effectively measuring the context switch. Check how long the array (in the upper example) should be such that two threads from the thread pool begin to show a real advantage over one thread one. This can easily become 100,000 elements, and therefore the effective context switching time will be somewhere in the 20us range in one application.

All encapsulations used by the thread pool must be counted by the time the threads switch, because this is what it all comes down to (at the end).

Atmapuri

+3
Feb 28 '10 at 10:37
source share

I donโ€™t know, but do you have regular performance counters in Windows Mobile? You can look at things like context switches / sec. I don't know if there is one that specifically measures context switching time, though.

+1
Nov 20 '08 at 9:27
source share



All Articles