I came across strange behavior in one of my WCF services. This service worked fine for about 1.5 years, but after a few weeks it shows some kind of “interruptions” (unfortunately, I can’t post the images, because I'm new here).
The call / second drops to 0, although there are still incoming calls. The “shutdown” always lasts 15 seconds. After these 15 seconds, the queues in the queue are processed. This cannot be network-related, because 90% of all calls come from another WCF service on the same server, and no other service (10 in total) is affected by this behavior. The service itself continues to work, for example, calculating internal things, doing database updates, etc. There is no increase in the time to complete the internal work. This happens after about 18-25 minutes, but the shutdown is always 15 seconds.
OS
Windows Server 2012
WCF runs as a Windows service
WCF Configuration:
InstanceContextMode = InstanceContextMode.PerCall,
ConcurrencyMode = ConcurrencyMode.Multiple,
UseSynchronizationContext = false,
IncludeExceptionDetailInFaults = true
Binding = WebHttpBinding
Concurrency Throttle Settings:
MaxConcurrentCalls = 384,
MaxConcurrentInstances = 2784,
MaxConcurrentSessions = 2400
I already studied:
I made a complete reset of the service at the exact time when this happened. Neither ConcurrentCalls nor ConcurrentSessions are exhausted. The dump did not show any exceptions that could be causing the problem.
- MAX TCP Conenction
Monitoring an active TCP connection is far from limited.
- Channel trunking in the switch
Since no calls come in, even from local services (using localhost), I'm sure this is not network related.
- Download problem
This occurs at low load (see below), as well as at high load (5 times more incoming calls). The frequency of this does not change depending on the load. I also tried to reproduce the behavior in my intermediate system at a speed of about 600-1000 calls per second. I managed to bring the service to a state where I sent more incoming calls / second, how the service can serve. Outstanding calls increased, and at some point the service collapsed, of course. But this behavior has never appeared.
- Thread Pool Exhaustion
The problem occurs when the service works with 50 threads, as well as with 200 threads. Although with more accessible threads there will be an error message.
I'm running out of possible things that can cause this behavior. I think it could be a GC lock thread, as the service uses about 10 GB in RAM. It is a kind of memory cache service. Or it may be the OS (Windows Server 2012) or something related to the Windows service itself.
Has anyone come across something similar on their own, or does anyone have another idea what might cause this?
Edit: Now I can send images:

Edit: Reset GC heap (thanks usr)

I see that almost 50% (only 70%, including related links) are caused by one large dictionary with approx. 27 million records (based on heap memory). I will focus on reorganizing it. It has a lot of unused items. Perhaps this will help.
In addition, I will add the GC.WaitForFullGCApproach Method from msdn to check if the GC is working while the service stops processing incoming requests.
I will keep you posted when I find out more.
Edit: GC statistics (14 seconds off)
•CLR Startup Flags: CONCURRENT_GC •Total CPU Time: 42.662 msec •Total GC CPU Time: 2.748 msec •Total Allocs : 1.524,637 MB •MSec/MB Alloc : 1,802 msec/MB •Total GC Pause: 2.977,2 msec •% Time paused for Garbage Collection: 19,4% •% CPU Time spent Garbage Collecting: 6,4% •Max GC Heap Size: 11.610,333 MB •Peak Process Working Set: 14.917,915 MB •Peak Virtual Memory Usage: 15.326,974 MB
This is “just” 3 seconds pause. In any case, it should not be so high, and I am going to reorganize the memory storage. But this does not explain 15 seconds :(
Edit: Over the weekend, I did the following:
The latest Windows updates installed (last update was 2 months ago)
Rebooted Windows Server
Implemented storage in memory of 27 million objects. I managed to reduce the used memory from 11 GB to 6-8 GB (which is quite a lot). Very old code there;)
The problem will not recur until now (about 17 hours of work now). This leads me to speculate that the GC caused a service suspension or OS-related issue, causing this behavior.
I assume that the problem has not been "solved" at all, and at some point the problem will reoccur, as the data will increase over time.
Thanks to everyone who spent the time on this. I will continue to investigate the landfills and try to find out what happened in the details. I'll keep you up to date.