Service Broker & AlwaysOn Availability Groups: Odd Transmission Queue Behavior

I’ve been working on a project over the past several months that will utilize Service Broker and AlwaysOn Availability Groups to meet some of the HA and DR goals of the company I work for (more info here). Just recently, I was able to implement the full solution in my development lab and point an instance of our website at it. While we were working out some kinks in our database and website to get the two working well with my Service Broker Replication project, I began noticing some odd behavior in Service Broker when it’s used with AlwaysOn Availability Groups, and I wanted to blog about it in an attempt to see if anyone else has seen this issue and might have an idea how to address it.

Update: I’ve also posted this question on

The Setup:

I have a Hyper-V host running 6 Windows Server 2008 R2 VMs (BTDevSQLVM1-BTDevSQLVM6). The VMs are grouped into 2-node WSFCs with node and file share quorum. I’ve installed standalone SQL 2012 Developer Edition instances on each of the VMs, and created an Availability Group with a listener on each cluster (SBReplDistrib, SBRepl1, & SBRepl2).

For the purpose of this blog post, I’ll be focusing on the communication between SBRepl1 and SBReplDistrib. The illustration below shows the Service Broker objects for each side of the conversation:

The Service Broker endpoints and routes are setup per this MSDN article.The SBRepl_Receive route in MSDB is for the local server’s service (//SBReplDistrib/SBRepl on SBReplDistrib, and //SBRepl1/SBRepl on SBRepl1), and points to the local instance. The SBRepl_Send route on SBRepl1 maps service //SBReplDistrib/SBRepl to TCP://SBReplDistrib:4022, and the SBRepl_Send_SBRepl1 route on SBReplDistrib is a similar mapping for the service on SBRepl1.

The Expected Behavior:

My understanding of how Service Broker handles message sending and receiving is thus (This is pretty simplified. There is a lot more detail about this process in Klaus Aschenbrenner’s (blog | twitter) book “Pro SQL Server 2008 Service Broker”):

  1. The initiator app creates a message (in this case, well formed XML)
  2. If there is an existing dialog conversation between the initiator service and the target service that is in the conversing status, the app can simply send the message on the existing conversation handle. Otherwise, the initiator app should begin a dialog conversation between the initiator service and the target service and send the message on that conversation handle.
  3. The message is placed in the sys.transmission_queue system table and Service Broker begins making attempts to deliver the message to the target service.
  4. Service Broker looks for an appropriate route and remote service binding and uses them to determine the address to connect to in order to deliver the message.
  5. Service Broker opens a connection to the target, authenticates, and delivers the message to the target service broker.
  6. The target Service Broker attempts to classify the message and determine what local service will handle the message (it uses route data in the msdb database for this).
  7. The target Service Broker delivers the message to the target service’s queue
  8. Once the message is successfully delivered to the target queue, the target Service Broker looks for route information back to the initiator and attempts to deliver an acknowledgement that the message was received.
  9. The initiator’s Service Broker receives the acknowledgement and uses routing information in MSDB to determine what local service the acknowledgement is for.
  10. Upon successful routing of the acknowledgement to the initiating service, the message is then removed from the sys.transmission_queue system table.
  11. If the initiator does not receive an acknowledgement that the message was received, it will periodically retry delivering the message to the target. If the target has already received the message, it will simply drop any additional delivery retries and send acknowledgements for them.

The Odd Behavior:

Step 11 is where I am seeing some very odd behavior with Service Broker and AlwaysOn. I see the message getting delivered to the target and processed successfully, and I also see the acknowledgement getting sent back to the initiator and received. However, the message remains in sys.transmission_queue as though no acknowledgement was received. To make things even more strange, Service Broker isn’t attempting to resend the message like I would expect it to if the acknowledgement wasn’t received. Instead, the message simply remain in the sys.transmission_queue, and as new messages are sent, they get delivered, acknowledged, and they too remain in the sys.transmission_queue. It seems to me like service broker is getting the acknowledgements and therefore stops trying to deliver the message, but doesn’t remove it from the sys.transmission_queue for some reason. The transmission_status for these messages remains blank, which should indicate that Service Broker hasn’t attempted to deliver them yet.

I checked the retention setting on the service queue, and it is set to off, but that should only impact the service queue and not the sys.transmission_queue. I have also traced both sides of the conversation using SQL Profiler, and I am able to see the message getting sent and the acknowledgement being sent back to the initiator and getting received (see XML trace data at the end of this post).

One odd thing did jump out at me in the traces though. I noticed that both sides seemed to be a bit confused about the TCP connections, because messages are sent from the IP address of the node itself while the service routes and the messages themselves point to the name/IP of the AG listener. This confusion appears to be causing each side to close the existing connection between the two services and create a new one in order to deliver a message or acknowledgement. I’m not sure if this is normal or not or if it has anything to do with why the acknowledgements aren’t being handled correctly, but it was the only thing I could see that could possibly explain the odd behavior.

The Plea for Help:

At this time, I don’t have a solution to this message retention issue other than to manually end the conversation with cleanup on both sides, and that’s not really something I want to do. If you have any ideas as to why this might be happening or what I can do about it, please leave me a comment and let me know. If there is any additional information that you would like me to provide about my setup or about the issue, please let me know in the comments as well. I will post a followup to this post if/when I find a solution to this issue.

The Trace Data:

I’ve removed the trace data from this blog post for brevity and security reasons. Please leave me a comment if you would like to review the trace files.

9 thoughts on “Service Broker & AlwaysOn Availability Groups: Odd Transmission Queue Behavior

  1. Is this fix or still there? as I am configuring SBK on always on but after encryption if always on failover to other.. What about queue which are proceed for Primary?
    Are these back or delete or reinitiate once its failover to replica!

    • This should be resolved as of SP1, but I no longer work at the company where I had this issue, so I haven’t been able to test\verify it.

      Service Broker does use transactions, so if it’s processing a message and the Availability Group fails over, then the transaction will rollback and the message will go back on the queue to be processed by the new primary. As for sending messages, if by some off chance that a message is sent more than once, the duplicate will be ignored as long as the message’s ID is the same (this is an internal ID assigned by Service Broker itself).

  2. Pingback: Service Broker & AlwaysOn Availability Groups: Odd Transmission Queue Behavior | XL-UAT

  3. Pingback: Availability Groups / Service Broker Transmission Queue Bug « Paul Brewer

  4. For the benefit of other reading your blog, the current solution is 1. Remove Database out of AG
    2. alter database set disable_broker with rollback immediate
    3. alter database set enable_broker with rollback immediate

  5. Just a quick update: I submitted this issue to MS Product Support, and they have found 2 bugs with Service Broker in SQL 2012. Due to the complexity of the bugs and the need for them to setup a new testing lab for this issue, they won’t be fixing these bugs until the next service pack for SQL 2012. I will post another update once the service pack is released and I have tested it out.

    • From Microsoft:

      Problem: There are two defects that we identified during the case.
      1) With AlwaysOn and SSB, the ACK makes it back to the initiator but the messages are not removed from the sys.transmission_queue.
      2) With AlwaysOn and SSB in a multi-subnet configuration it can take up to 3 minutes for a new connection to send messages.


      VSTS: 962792 (Problem 1) — When AlwaysOn cannot determine the primary and switches states it shutsdown the ActiveServiceBroker for the database. The ASB has a class which calls a function that removes the messages from the sys.transmission_queue. Because there is not an instance of this class, messages are not removed from the sys.transmission_queue. When the AlwaysOn state determines the primary there is no code to check and restart the ASB.

      VSTS: 966329 (Problem 2) — When AlwaysOn and SSB are configured in a multi-subnet configuration it is possible that a new connection can take 3 minutes to deliver a message. SSB uses the AG listener to determine where to send the message along with routes. We see the IP address returned of the replica first then timesouts out and tries the primary and is successful. It does this for sending a message and also for the ACK. This mean a total round trip time of up to 3 minutes because of the timeout period. If an connection to the primary exists it is fast.

      Resolution: SSB development will address the fixes in a future service pack. Right now we can only reproduce the issue in VM’s with multi-subnets by pausing and resuming the windows cluster resource. The SSB development labs are not configured for multi-subnets, so they are not comfortable with the current manual test to release fixes in a CU and want to wait for the labs to be configured with subnets and do full automated testing for a service pack because of the complexity of the fix. So the plan is to code the fixes in a future service pack to get wider test coverage.

Leave a Reply to Dianne Cancel reply