MSMQ messages bound to a cluster instance of MSMQ are stuck in outbound queues

We have compiled MSMQ for the NServiceBus service suite, and everything works fine until this happens. Outgoing queues on one server begin to fill up, and pretty soon the whole system hangs.

More details:

We have a clustered MSMQ between servers N1 and N2. Other cluster resources are only services that work directly in cluster queues as local ones, that is, NServiceBus distributors.

All workflows live on separate servers, Services3 and Services4.

For those unfamiliar with NServiceBus, work shifts to a clustered work queue managed by a distributor. Work applications on Service3 and Services4 send "I'm ready to work" messages to the cluster management queue managed by the same distributor, and the distributor responds by sending the unit of work to the workflow input queue.

At some point, this process can completely hang. Below is an image of outgoing queues in a cluster instance of MSMQ when the system hangs:

Clustered MSMQ Outgoing Queues in Hung State

If I move through the cluster to another node, it, like the whole system, gets hit in the pants. The following is an image of the same cluster instance of MSMQ shortly after failure:

Clustered MSMQ Outgoing Queues After Failover

Can someone explain this behavior and what can I do to avoid it so that the system runs smoothly?

+7
cluster-computing message-queue msmq msdtc nservicebus
source share
3 answers

More than a year later, it seems that our problem has been resolved. The key take-outs are apparently:

  • Make sure you have a reliable DNS system, so when MSMQ needs to resolve the host, it can.
  • Create only one clustered instance of MSMQ in a Windows failover cluster.

When we set up our fail-safe Windows cluster, we made the assumption that it would be bad to β€œwaste” resources on an inactive node, and therefore, having two quasi-connected NServiceBus clusters at that time, we made a cluster instance of MSMQ for Project1 and another cluster instance of MSMQ for Project2. We believed that most of the time we will run them on separate nodes, and in the service windows they will be placed on the same node. In the end, it was the setup we set up for our primary and dev instances of SQL Server 2008, and it works very well.

At some point, I began to doubt this approach, especially since each instance of MSMQ once or twice seemed to always return messages again.

I asked Udi Dahan (author of NServiceBus) about this cluster hosting strategy, and he gave me a puzzled expression and asked: "Why do you want to do something like this?" In fact, the Distributor is very lightweight, so there really is no reason to evenly distribute them between available nodes.

After that, we decided to take everything we learned and recreate a new failover cluster with only one instance of MSMQ . Since then we have not seen this. Of course, to solve this problem would be negative and therefore impossible. This has not been a problem for at least 6 months, but who knows, I believe it may end tomorrow! Do not expect.

+2
source share

Your servers may have been cloned and thus have the same queue manager identifier (QMId).

MSMQ uses QMId as a hash to cache the addresses of remote computers. If more than one machine has the same QMId on your network, you may receive stuck or missing messages.

Check out the explanations and solutions on this blog: http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx

+2
source share

How are your endpoints configured to save your subscription?

What should I do if one (or more) of your services detects an error and restarts with the Failoverclustermanager server? In this case, this service will never receive one of the messages "I'm ready to work" from other services.

When you move to another node, I think that all your services will send these messages again and, as a result, everything will return to work.

To verify this behavior, follow these steps:

  • Stop and restart all your services.
  • Stop only one of the services.
  • Restart the stopped service.
  • If your system does not freeze, repeat this with each individual service.

If your system freezes again, check your configurations. In this case, your at least one, if not all, services lose their subscription between restarts. If you did not, save the subscription in the database.

+1
source share

All Articles