I am currently performing some tests of the server application I developed, relying heavily on the async / wait C # 5 constructors.
This is a console application, so there is no synchronization context, and no threads are explicitly created in the code. The application drops the requests from the MSMQ queue as fast as it can (asynchronous detection cycle), and processes each request before sending processed requests through the HttpClient.
Async / await-based I / Os are removed from MSMSQ by reading data / writing data to the SQL server database and finally the HttpClient request sent at the end of the chain.
Currently, for my tests, the database is completely faked (the results are directly returned via Task.FromResult), and the HttpClient is also faked (expect a random .Delay task between 0-50 ms and return an answer), so the only I / O is disconnecting from MSMQ
I had already significantly increased the application throughput by seeing that a lot of time had been spent in the GC, so I used the CLR Profiler and found out where I can optimize things.
Now I'm trying to figure out if I can increase the bandwidth, and I think it is possible.
There are two things that I don’t understand, and perhaps there’s an opportunity to improve performance for this:
1) I have 4 processor cores (in fact, these are just 2 real i7 processors), and when the application starts, it most uses 3 CPU cores (in VS2012 concurrency I can clearly see that only 3 cores are used, but in windows perfmon, I see how CPU usage is viewed at ~ 75/80%). Any idea why? I do not control the threads, because I do not create them explicitly, but rely only on tasks, so why does the task scheduler not maximize CPU usage in my case? Has anyone experienced this?
2) Using the VS2012 compatibility visualizer, I see a very high synchronization time (approximately 20% execution and 80% synchronization). FYI About 15 threads are created.
Approximately 60% of the synchronization comes from the following call stack:
clr.dll!ThreadPoolMgr::WorkerThreadStart clr.dll!CLRSemaphore::Wait kernelbase.dll!WaitForSingleObjectEx
and
clr.dll!ThreadPoolMgr::WorkerThreadStart clr.dll!ThreadPoolMgr::UnfairSemaphore::Wait clr.dll!CLRSemaphore::Wait kernelbase.dll!WaitForSingleObjectEx
And about 30% of the synchronization comes from:
clr.dll!ThreadPoolMgr::CompletionPortThreadStart kernel32.dll!GetQueueCompletionStatusStub kernelbase.dll!GetQueuedCompletionStatus ntdll.dll!ZwRemoveIoCompletion ..... blablabla ntoskrnl.exe!KeRemoveQueueEx
I do not know if it is normal to experience such high synchronization or not.
EDIT: Based on Steven's answer, I am adding more details about my implementation:
Indeed, my server is completely asynchronous. However, some CPU work is done to process each message (not so much I admit, but still some). After the message is received from the MSMQ queue, it is first deserialized (most of the processor / memory cost seems to happen at this point), then it goes through the various processing / verification steps that are behind some processor, before finally reach the "end" of the pipe, where the processed message is sent to the outside world through HttpClient.
My implementation does not wait for the message to be completely processed before deleting the next one from the queue. Indeed, my message pump, discarding messages from the queue, is very simple and immediately forwards the message in order to be able to deactivate the next one. The simplified code is as follows (exception management exception, cancellation ...):
while (true) { var message = await this.queue.ReceiveNextMessageAsync(); this.DeserializeDispatchMessageAsync(); } private async void DeserializeDispatchMessageAsync() {
ReceiveNextMessageAsync is a custom method using TaskCompletionSource since the .NET MessageQueue not proposed by any async method in the .NET Framework 4.5. So I just use a BeginReceive / EndReceive with TaskCompletionSource .
This is one of the only places in my code where I do not expect the async method. The cycle is deleted as quickly as it can. It does not even wait for message deserialization (message deserialization deserialization is performed with a lazy .NET FCL Message implementation with direct access to the Body property). I immediately run Task.Yield () to deploy the deserialization / message process to another task and immediately release the loop.
Right now, in the context of my benches, as I said, all the inputs / outputs (only access to the database) are forged. All calls to async methods to retrieve data from the database simply return Task.FromResult with fake data. During message processing, there is one of the 20 database calls, and all of them are now faked / synchronous. The only asynchronization point is at the end of message processing, where it receives transmission through HttpClient. Sending HttpClient is also faked, but I do a random (0-50 ms) "wait for Task.Elay" at this point. In any case, due to falsification of the database, each message processing can be considered as one task.
For my stands, I save about 300 thousand messages in a queue, and then I launch a server application. It removes pretty quickly the flood of the server application, and all messages are processed simultaneously. That’s why I don’t understand why I don’t get to 100% of the CPU and 4 cores, but only 75% and 3 cores are used (as opposed to synchronization).
When I delete only without deserializing or processing messages (by commenting on the DeserializeDispatchMessageAsync call, I achieve a throughput of about 20 thousand messages / sec. When I do all the processing, I achieve a throughput of about 10 thousand messages / sec.
The fact that messages are quickly removed from the queue and that deserialization + message processing is performed in a separate task makes me visualize in my head many tasks (one per message) queued in the task scheduler (thread pool here ... no context synchronization), so I would expect the thread pool to send all of these messages to the maximum number of cores, and all 4 cores are fully occupied to handle all the tasks, but it doesn't seem to be that way.
In any case, any answer is welcome, I am looking for any idea / advice.