CPU usage is not maximized and high synchronization in async / wait based server application

I am currently performing some tests of the server application I developed, relying heavily on the async / wait C # 5 constructors.

This is a console application, so there is no synchronization context, and no threads are explicitly created in the code. The application drops the requests from the MSMQ queue as fast as it can (asynchronous detection cycle), and processes each request before sending processed requests through the HttpClient.

Async / await-based I / Os are removed from MSMSQ by reading data / writing data to the SQL server database and finally the HttpClient request sent at the end of the chain.

Currently, for my tests, the database is completely faked (the results are directly returned via Task.FromResult), and the HttpClient is also faked (expect a random .Delay task between 0-50 ms and return an answer), so the only I / O is disconnecting from MSMQ

I had already significantly increased the application throughput by seeing that a lot of time had been spent in the GC, so I used the CLR Profiler and found out where I can optimize things.

Now I'm trying to figure out if I can increase the bandwidth, and I think it is possible.

There are two things that I don’t understand, and perhaps there’s an opportunity to improve performance for this:

1) I have 4 processor cores (in fact, these are just 2 real i7 processors), and when the application starts, it most uses 3 CPU cores (in VS2012 concurrency I can clearly see that only 3 cores are used, but in windows perfmon, I see how CPU usage is viewed at ~ 75/80%). Any idea why? I do not control the threads, because I do not create them explicitly, but rely only on tasks, so why does the task scheduler not maximize CPU usage in my case? Has anyone experienced this?

2) Using the VS2012 compatibility visualizer, I see a very high synchronization time (approximately 20% execution and 80% synchronization). FYI About 15 threads are created.

Approximately 60% of the synchronization comes from the following call stack:

clr.dll!ThreadPoolMgr::WorkerThreadStart clr.dll!CLRSemaphore::Wait kernelbase.dll!WaitForSingleObjectEx 

and

 clr.dll!ThreadPoolMgr::WorkerThreadStart clr.dll!ThreadPoolMgr::UnfairSemaphore::Wait clr.dll!CLRSemaphore::Wait kernelbase.dll!WaitForSingleObjectEx 

And about 30% of the synchronization comes from:

 clr.dll!ThreadPoolMgr::CompletionPortThreadStart kernel32.dll!GetQueueCompletionStatusStub kernelbase.dll!GetQueuedCompletionStatus ntdll.dll!ZwRemoveIoCompletion ..... blablabla ntoskrnl.exe!KeRemoveQueueEx 

I do not know if it is normal to experience such high synchronization or not.

EDIT: Based on Steven's answer, I am adding more details about my implementation:

Indeed, my server is completely asynchronous. However, some CPU work is done to process each message (not so much I admit, but still some). After the message is received from the MSMQ queue, it is first deserialized (most of the processor / memory cost seems to happen at this point), then it goes through the various processing / verification steps that are behind some processor, before finally reach the "end" of the pipe, where the processed message is sent to the outside world through HttpClient.

My implementation does not wait for the message to be completely processed before deleting the next one from the queue. Indeed, my message pump, discarding messages from the queue, is very simple and immediately forwards the message in order to be able to deactivate the next one. The simplified code is as follows (exception management exception, cancellation ...):

 while (true) { var message = await this.queue.ReceiveNextMessageAsync(); this.DeserializeDispatchMessageAsync(); } private async void DeserializeDispatchMessageAsync() { // Immediately yield to avoid blocking the asynchronous messaging pump // while deserializing the body which would otherwise impact the throughput. await Task.Yield(); this.messageDispatcher.DispatchAsync(message).ForgetSafely(); } 

ReceiveNextMessageAsync is a custom method using TaskCompletionSource since the .NET MessageQueue not proposed by any async method in the .NET Framework 4.5. So I just use a BeginReceive / EndReceive with TaskCompletionSource .

This is one of the only places in my code where I do not expect the async method. The cycle is deleted as quickly as it can. It does not even wait for message deserialization (message deserialization deserialization is performed with a lazy .NET FCL Message implementation with direct access to the Body property). I immediately run Task.Yield () to deploy the deserialization / message process to another task and immediately release the loop.

Right now, in the context of my benches, as I said, all the inputs / outputs (only access to the database) are forged. All calls to async methods to retrieve data from the database simply return Task.FromResult with fake data. During message processing, there is one of the 20 database calls, and all of them are now faked / synchronous. The only asynchronization point is at the end of message processing, where it receives transmission through HttpClient. Sending HttpClient is also faked, but I do a random (0-50 ms) "wait for Task.Elay" at this point. In any case, due to falsification of the database, each message processing can be considered as one task.

For my stands, I save about 300 thousand messages in a queue, and then I launch a server application. It removes pretty quickly the flood of the server application, and all messages are processed simultaneously. That’s why I don’t understand why I don’t get to 100% of the CPU and 4 cores, but only 75% and 3 cores are used (as opposed to synchronization).

When I delete only without deserializing or processing messages (by commenting on the DeserializeDispatchMessageAsync call, I achieve a throughput of about 20 thousand messages / sec. When I do all the processing, I achieve a throughput of about 10 thousand messages / sec.

The fact that messages are quickly removed from the queue and that deserialization + message processing is performed in a separate task makes me visualize in my head many tasks (one per message) queued in the task scheduler (thread pool here ... no context synchronization), so I would expect the thread pool to send all of these messages to the maximum number of cores, and all 4 cores are fully occupied to handle all the tasks, but it doesn't seem to be that way.

In any case, any answer is welcome, I am looking for any idea / advice.

+7
performance synchronization c # concurrency async-await
source share
1 answer

It looks like your server is almost completely asynchronous (async MSMQ, async DB, async HttpClient). Therefore, in this case, I do not find your results unexpected.

Firstly, very little work with the CPU. I would fully expect that each of the threads of the threading thread will sit for most of the time, waiting for the work to complete. Remember that during natural asynchronous operation the processor is not used.

Task returned by the asynchronous operation MSMQ / DB / HttpClient is not executed on the thread pool thread; it simply represents the completion of an I / O operation. The only work you can see with the thread pool is a short amount of synchronous work inside the asynchronous methods, which usually just streamline the buffers for I / O.

As for bandwidth, you have room to scale (assuming your test has populated your existing service). Perhaps your code simply (asynchronously) extracts one value from MSMQ and then (asynchronously) processes it until another value is obtained; in this case, you will definitely see an improvement from continuous reading from MSMQ. Remember that async code is asynchronous, but it is still serializing; your async method can stop on any await .

In this case, you may need to configure the TPL data stream pipeline (with MaxDegreeOfParallelism set to Unbounded ) and run a hard loop that asynchronously reads from MSMQ and transfers the data to the pipeline. This would be easier than doing your own floor processing.

Update for editing:

I have some suggestions:

  • Use Task.Run instead of await Task.Yield . Task.Run has a clearer intention.
  • Begin / End Shells can use Task.Factory.FromAsync instead of TCS, which gives you cleaner code.

But I see no reason why the last core would not be used - a ban on obvious reasons, such as a profiler or other application that would be busy. As a result, you get the equivalent of async dynamic parallelism , which is one of the situations when the .NET thread pool was specifically designed for processing.

+4
source share

All Articles