Event overhead

I have my own thread pool pool, which creates some threads, each of which is waiting for its own event (signal). When a new task is added to the thread pool, it wakes up the first free thread to complete the task.

The problem is this: I have about 1000 cycles of each of 10,000 reps. These loops should be executed sequentially, but I have 4 processors. What I'm trying to do is to split 10'000 iteration cycles into 4,200 iteration cycles, i.e. one per stream. But I have to wait until the completion of 4 small cycles before proceeding to the next "big" iteration. This means that I cannot link the tasks.

My problem is that using a pool of threads and 4 threads is much slower than running tasks sequentially (one loop being executed by a separate thread is much slower than running it directly in the main thread sequentially).

I am on Windows, so I create events with CreateEvent() and then wait for one of them with WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent() .

It seems that this whole thing (along with thread synchronization using critical partitions) is pretty expensive!

My question is: is it normal that using events takes a lot of time? If so, is there any other mechanism that I could use, and that would be less expensive?

Here is some code to illustrate (some relevant parts copied from the thread pool class):

 // thread function unsigned __stdcall ThreadPool::threadFunction(void* params) { // some housekeeping HANDLE signals[2]; signals[0] = waitSignal; signals[1] = endSignal; do { // wait for one of the signals waitResult = WaitForMultipleObjects(2, signals, false, INFINITE); // try to get the next job parameters; if (tp->getNextJob(threadId, data)) { // execute job void* output = jobFunc(data.params); // tell thread pool that we're done and collect output tp->collectOutput(data.ID, output); } tp->threadDone(threadId); } while (waitResult - WAIT_OBJECT_0 == 0); // if we reach this point, endSignal was sent, so we are done ! return 0; } // create all threads for (int i = 0; i < nbThreads; ++i) { threadData data; unsigned int threadId = 0; char eventName[20]; sprintf_s(eventName, 20, "WaitSignal_%d", i); data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction, this, CREATE_SUSPENDED, &threadId); data.threadId = threadId; data.busy = false; data.waitSignal = CreateEvent(NULL, true, false, eventName); this->threads[threadId] = data; // start thread ResumeThread(data.handle); } // add job void ThreadPool::addJob(int jobId, void* params) { // housekeeping EnterCriticalSection(&(this->mutex)); // first, insert parameters in the list this->jobs.push_back(job); // then, find the first free thread and wake it for (it = this->threads.begin(); it != this->threads.end(); ++it) { thread = (threadData) it->second; if (!thread.busy) { this->threads[thread.threadId].busy = true; ++(this->nbActiveThreads); // wake thread such that it gets the next params and runs them SetEvent(thread.waitSignal); break; } } LeaveCriticalSection(&(this->mutex)); } 
+6
c ++ multithreading synchronization events overhead
source share
9 answers

If you just parallelize loops and use vs 2008, I would suggest looking at OpenMP. If you are using the beta version of visual studio 2010, I suggest considering a parallel template library, specifically "parallel for" / "parallel for each" apis or task group class, because they are likely to do what you are trying to do with only less code.

As for your performance question, it really depends here. You will need to see how much work you plan during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it hard and your work is small, so I suggest using an existing implementation. You also need to make sure that you are not working in debug mode, under the debugger, and that the tasks themselves do not block locking, I / O or memory allocation, and you do not receive a false exchange. Each of them can destroy scalability.

I would suggest looking at this under the profiler, for example the xperf f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency that help you see the competition) or Intel vtune.

You can also share the code that you execute in tasks so that people can better understand what you are doing, because the answer that I always get with performance problems is “dependent” first, and secondly, “You profile it”.

Luck

-Rick

+1
source share

Yes, WaitForMultipleObjects quite expensive. If your tasks are small, the synchronization overhead will begin to overload the cost of actually completing the task, as you can see.

One way to fix this is to combine several tasks into one: if you get a “small” job (no matter how you value such things), save it somewhere until you have enough small tasks to do one task on reasonable size. Then send all of them to the workflow for processing.

Alternatively, instead of using an alarm, you can use a single-writer queue with multiple readers to store your tasks. In this model, each worker thread tries to grab jobs from the queue. When he finds one, he does the job; if it is not, he sleeps for a short period of time, then wakes up and tries again. This will reduce your overhead, but your threads will take up the processor, even when there is no work. It all depends on the exact nature of the problem.

+3
source share

It looks like an example of a consumer-manufacturer that can be implanted with two semaphores, one of which protects the queue overflow, and the other protects an empty one.

You can find some details here .

+3
source share

Beware, you are still asking for the next job after endSignal comes out.

 for( ;; ) { // wait for one of the signals waitResult = WaitForMultipleObjects(2, signals, false, INFINITE); if( waitResult - WAIT_OBJECT_0 != 0 ) return; //.... } 
+2
source share

Switching context between threads can also be expensive. In some cases, it’s interesting to develop a structure that you can use to sequentially process your tasks with a single thread or multiple threads. Thus, you can get the best of both worlds.

By the way, what is your question? I can more accurately answer a more accurate question :)

EDIT:

In some cases, some events may consume more than your processing, but should not be so expensive unless your processing is very fast. In this case, switching between thredas is also expensive, so my answer first is regarding the execution of the sequence ...

You should look for synchronization bottlenecks between threads. You can track the latency of threads to start with ...

EDIT: After more hints ...

If I understood correctly, your problem is to effectively use all your nuclear processors or processors for parallel partial processing of individual sequences.

Take that you have 4 cores and 10,000 cycles to calculate, as in your example (in the comment). You said that you need to wait until 4 threads finish before continuing. Then you can simplify the synchronization process. You just need to pass four threads: nth, nth + 1, nth + 2, nth + 3 loops, wait for the four threads to finish. You should use a rendezvous or a barrier (a synchronization mechanism waiting for n threads to complete). Boost has such a mechanism. You can look at the implementation of windows for efficiency. Your thread pool is not really suitable for this task. Finding an available thread in a critical section is what kills your CPU time. Not part of the event.

+1
source share

It should not be so expensive, but if your work takes almost no time, then the overhead of threads and synchronization objects will become significant. Thread pools like this work much better for longer processing tasks or for those who use a lot of I / O instead of CPU resources. If you work with a processor while processing a job, make sure that you have only one thread per processor.

Other issues may arise, how does getNextJob get its data for processing? If you have a large number of copies of data, then you have significantly increased your overhead.

I would optimize it by allowing each thread to continue working from the queue until the queue is empty. Thus, you can transfer one hundred jobs to the thread pool, and synchronization objects will be used only once to start the thread. I also saved the jobs in the queue and passed them a pointer, a link, or an iterator to the stream instead of copying the data.

+1
source share

It seems that this whole event (along with synchronization between threads using critical sections) is quite expensive!

“Expensive” is a relative term. Are airplanes expensive? Have a car? or bicycles ... shoes ...?

In this case, the question arises: are the events “expensive” regarding the execution time of the JobFunction? This would help to publish some absolute numbers: how long does the process take when "no rights"? Are these months or a few femtoseconds?

What happens over time when threadpool grows larger? Try pool size 1, then 2, then 4, etc.

Also, since you had some problems with threadpools in the past, I would suggest some debugs to count the number of times your thread function is actually called ... Does this match what you expect?

Removing the figure from the air (not knowing anything about your target system and assuming that you are not doing anything “huge” in the code that you did not show), I would expect the “overhead” of each of them “work” should be measured in microseconds. Maybe a hundred or so. If the execution time of the algorithm in JobFunction is not much longer than this time, your threads will most likely cost you time, not save it.

+1
source share

Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal iterations of the 2500 loop is tiny (in the range of a few microseconds). Then you can’t do much except to look at your algorithm to separate large chunks of precession; OpenMP will not help, and any other synchronization methods will not help either because they rely on events in principle (loop cycles do not meet the criteria).

On the other hand, if the processing time for iterations of the 2500 cycle exceeds 100 microseconds (on current PCs), you may run into hardware limitations. If your processing uses more memory bandwidth, splitting it into four processors will not give you more bandwidth, it will actually give you less due to collisions. You may also encounter cyclic caching problems, where each of your 1000th iteration will clear and reload the cache from 4 cores. Then there is not a single solution, and depending on your target equipment, they may not be.

+1
source share

As mentioned earlier, the amount of overhead added by the threads depends on the relative amount of time spent on the tasks that you defined. Therefore, it is important to find a balance in the amount of working pieces, which minimizes the number of fragments, but does not leave the processors in standby mode, waiting for the completion of the last group of calculations.

Your coding approach has increased overhead by actively looking for idle threads to ensure new work. The operating system already tracks this and makes it much more efficient. In addition, your ThreadPool :: addJob () function may detect that all threads are in use and cannot delegate work. But it does not provide the return code associated with this problem. If you do not check this condition in any way and do not notice errors in the results, this means that there are always inactive processors. I would suggest reorganizing the code so that addJob () does what it calls, adds work ONLY (without detecting or even caring about who does this work), while each worker thread actively gets a new job when it being done with her existing work.

0
source share

All Articles