Monitor.TryEnter and Threading.Timer race condition

I have a windows service that checks every 5 seconds. It uses System.Threading.Timer to handle validation and processing, and Monitor.TryEnter to ensure that only one thread validates.

Suppose this should be so, since the following code is part of 8 other workers created by the service, and each worker has his own particular type of work that needs to be checked.

 readonly object _workCheckLocker = new object(); public Timer PollingTimer { get; private set; } void InitializeTimer() { if (PollingTimer == null) PollingTimer = new Timer(PollingTimerCallback, null, 0, 5000); else PollingTimer.Change(0, 5000); Details.TimerIsRunning = true; } void PollingTimerCallback(object state) { if (!Details.StillGettingWork) { if (Monitor.TryEnter(_workCheckLocker, 500)) { try { CheckForWork(); } catch (Exception ex) { Log.Error(EnvironmentName + " -- CheckForWork failed. " + ex); } finally { Monitor.Exit(_workCheckLocker); Details.StillGettingWork = false; } } } else { Log.Standard("Continuing to get work."); } } void CheckForWork() { Details.StillGettingWork = true; //Hit web server to grab work. //Log Processing //Process Work } 

Now here is the problem:
The above code allows two timer threads to enter the CheckForWork() method. I honestly do not understand how this is possible, but I have experienced this with several clients that run this software.

The logs I received today when I clicked on some work showed that he double-checked the work, and I had 2 threads trying to process independently, which continued to cause work.

 Processing 0-3978DF84-EB3E-47F4-8E78-E41E3BD0880E.xml for Update Request. - at 09/14 10:15:501255801 Stopping environments for Update request - at 09/14 10:15:501255801 Processing 0-3978DF84-EB3E-47F4-8E78-E41E3BD0880E.xml for Update Request. - at 09/14 10:15:501255801 Unloaded AppDomain - at 09/14 10:15:10:15:501255801 Stopping environments for Update request - at 09/14 10:15:501255801 AppDomain is already unloaded - at 09/14 10:15:501255801 === Starting Update Process === - at 09/14 10:15:513756009 Downloading File X - at 09/14 10:15:525631183 Downloading File Y - at 09/14 10:15:525631183 === Starting Update Process === - at 09/14 10:15:525787359 Downloading File X - at 09/14 10:15:525787359 Downloading File Y - at 09/14 10:15:525787359 

Logs are written asynchronously and queued, so do not understand too deeply that the time matches exactly, I just wanted to indicate what I saw in the logs to show that I had 2 threads that fell into the code section, which, like I think it was never allowed. (Log and time are real though, just disinfected messages)

In the end, what happens is that 2 threads begin to load a large enough file, in which everyone gains access to the file that failed in this file, and causes the entire update to fail.

How does this code really resolve? I ran into this problem last year when I had a lock instead of Monitor , and suggested that it was only because the timer eventually started to get enough offset due to the lock lock with which I received timer streams, i.e. e. one is locked for 5 seconds and passed correctly when the timer started another callback, and both of them somehow entered it. That's why I went with the Monitor.TryEnter option, so I would not just create stacking timer threads.

Any clue? In all cases, when I tried to solve this problem earlier, System.Threading.Timer was the only constant, and I consider it to be the main reason, but I do not understand why.

+6
source share
2 answers

TL DR
The production stored procedure has not been updated for years. Workers received work that they never received, and therefore several workers processed update requests.


I was finally able to find the time to properly configure myself locally in order to act as a production client through Visual Studio. Although, I could not reproduce it, as I experienced, I accidentally stumbled upon this problem.

Those who have the assumption that several workers selected a job were really correct and that something that could never happen, since each worker is unique in his work and request.

It turns out that in our production environment, the stored procedure for retrieving work based on the type of work has not been updated for years (yes, years!) Of deployments. Everything that checked for work automatically received updates that they had in mind when Update and Work Foo employees were checked at the same time, both of them worked the same way.

Fortunately, a patch is a side of the database, not a client upgrade.

0
source

I see that in the log you indicated that you have a restart of AppDomain , is this correct? If so, are you sure that you have one and only object for your service during the restart of AppDomain ? I think that during this, not all threads stop right at the same time, and some of them can continue to poll the work queue, so two different threads in different AppDomain got the same Id for work.

You could probably fix this by marking _workCheckLocker with the static as follows:

 static object _workCheckLocker; 

and imagine a static constructor for your class with initialization of this field (in the case of built-in initialization, you may encounter more complex problems), but I'm not sure if this will be enough for your case - during AppDomain restarting the static class will also reboot. As far as I understand, this is not an option for you.

Perhaps you could introduce a static dictionary instead of an object for your workers, so you can check the Id for documents in the process.

Another approach is to handle the Stopping event for your service, which could probably be raised during the restart of the AppDomain , in which you enter the CancellationToken and use it to stop all work in such circumstances.

Also, as @ fernando.reyes said, you could introduce a lowercase lock structure called mutex for synchronization, but that will degrade your performance.

0
source

All Articles