I’ve been working on a project over the past several months that will utilize Service Broker and AlwaysOn Availability Groups to meet some of the HA and DR goals of the company I work for (more info here). Just recently, I was able to implement the full solution in my development lab and point an instance of our website at it. While we were working out some kinks in our database and website to get the two working well with my Service Broker Replication project, I began noticing some odd behavior in Service Broker when it’s used with AlwaysOn Availability Groups, and I wanted to blog about it in an attempt to see if anyone else has seen this issue and might have an idea how to address it.
Update: I’ve also posted this question on DBA.StackExchange.com.
I have a Hyper-V host running 6 Windows Server 2008 R2 VMs (BTDevSQLVM1-BTDevSQLVM6). The VMs are grouped into 2-node WSFCs with node and file share quorum. I’ve installed standalone SQL 2012 Developer Edition instances on each of the VMs, and created an Availability Group with a listener on each cluster (SBReplDistrib, SBRepl1, & SBRepl2).
For the purpose of this blog post, I’ll be focusing on the communication between SBRepl1 and SBReplDistrib. The illustration below shows the Service Broker objects for each side of the conversation:
The Service Broker endpoints and routes are setup per this MSDN article.The SBRepl_Receive route in MSDB is for the local server’s service (//SBReplDistrib/SBRepl on SBReplDistrib, and //SBRepl1/SBRepl on SBRepl1), and points to the local instance. The SBRepl_Send route on SBRepl1 maps service //SBReplDistrib/SBRepl to TCP://SBReplDistrib:4022, and the SBRepl_Send_SBRepl1 route on SBReplDistrib is a similar mapping for the service on SBRepl1.
The Expected Behavior:
My understanding of how Service Broker handles message sending and receiving is thus (This is pretty simplified. There is a lot more detail about this process in Klaus Aschenbrenner’s (blog | twitter) book “Pro SQL Server 2008 Service Broker”):
- The initiator app creates a message (in this case, well formed XML)
- If there is an existing dialog conversation between the initiator service and the target service that is in the conversing status, the app can simply send the message on the existing conversation handle. Otherwise, the initiator app should begin a dialog conversation between the initiator service and the target service and send the message on that conversation handle.
- The message is placed in the sys.transmission_queue system table and Service Broker begins making attempts to deliver the message to the target service.
- Service Broker looks for an appropriate route and remote service binding and uses them to determine the address to connect to in order to deliver the message.
- Service Broker opens a connection to the target, authenticates, and delivers the message to the target service broker.
- The target Service Broker attempts to classify the message and determine what local service will handle the message (it uses route data in the msdb database for this).
- The target Service Broker delivers the message to the target service’s queue
- Once the message is successfully delivered to the target queue, the target Service Broker looks for route information back to the initiator and attempts to deliver an acknowledgement that the message was received.
- The initiator’s Service Broker receives the acknowledgement and uses routing information in MSDB to determine what local service the acknowledgement is for.
- Upon successful routing of the acknowledgement to the initiating service, the message is then removed from the sys.transmission_queue system table.
- If the initiator does not receive an acknowledgement that the message was received, it will periodically retry delivering the message to the target. If the target has already received the message, it will simply drop any additional delivery retries and send acknowledgements for them.
The Odd Behavior:
Step 11 is where I am seeing some very odd behavior with Service Broker and AlwaysOn. I see the message getting delivered to the target and processed successfully, and I also see the acknowledgement getting sent back to the initiator and received. However, the message remains in sys.transmission_queue as though no acknowledgement was received. To make things even more strange, Service Broker isn’t attempting to resend the message like I would expect it to if the acknowledgement wasn’t received. Instead, the message simply remain in the sys.transmission_queue, and as new messages are sent, they get delivered, acknowledged, and they too remain in the sys.transmission_queue. It seems to me like service broker is getting the acknowledgements and therefore stops trying to deliver the message, but doesn’t remove it from the sys.transmission_queue for some reason. The transmission_status for these messages remains blank, which should indicate that Service Broker hasn’t attempted to deliver them yet.
I checked the retention setting on the service queue, and it is set to off, but that should only impact the service queue and not the sys.transmission_queue. I have also traced both sides of the conversation using SQL Profiler, and I am able to see the message getting sent and the acknowledgement being sent back to the initiator and getting received (see XML trace data at the end of this post).
One odd thing did jump out at me in the traces though. I noticed that both sides seemed to be a bit confused about the TCP connections, because messages are sent from the IP address of the node itself while the service routes and the messages themselves point to the name/IP of the AG listener. This confusion appears to be causing each side to close the existing connection between the two services and create a new one in order to deliver a message or acknowledgement. I’m not sure if this is normal or not or if it has anything to do with why the acknowledgements aren’t being handled correctly, but it was the only thing I could see that could possibly explain the odd behavior.
The Plea for Help:
At this time, I don’t have a solution to this message retention issue other than to manually end the conversation with cleanup on both sides, and that’s not really something I want to do. If you have any ideas as to why this might be happening or what I can do about it, please leave me a comment and let me know. If there is any additional information that you would like me to provide about my setup or about the issue, please let me know in the comments as well. I will post a followup to this post if/when I find a solution to this issue.
The Trace Data:
I’ve removed the trace data from this blog post for brevity and security reasons. Please leave me a comment if you would like to review the trace files.