AlwaysOn Availability Groups, Isolation Levels, & Selects Blocking the Redo Thread

This past weekend, I had the pleasure of presenting my session SQL Server 2012 AlwaysOn Readable Secondaries at SQL Saturday #126 in Indianapolis, IN. In that session, I covered the basics of AlwaysOn Availability Groups, how to set them up, failing over, and using readable secondary replicas for read-only workloads and backups. During the session, I always make sure to mention the possibility of queries running on the secondary replica blocking the redo thread that is trying to write changes from the primary database to the secondary database.

After mentioning this caveat, I was asked a very good question: Can you use transaction isolation levels (such as snapshot isolation) with AlwaysOn Availability Groups, and would they help to avoid the issue of read-only queries on secondary replicas blocking the redo thread? In order to answer this question, I’m going to break it two parts, and then we’ll work through a couple demos to illustrate the answers.

Questions:

Q: Can you use transaction isolation levels with AlwaysOn Availability Groups?
A:
Yes and no. Yes, you can set whatever isolation level you would like when running queries on the secondary replicas, and SQL Server will not return an error. However, SQL Server will automatically override the isolation level (and ignore all lock hints) when querying a read-only replica, and instead force the queries to use snapshot isolation.

Per the “Benefits” topic of the Active Secondaries: Readable Secondary Replicas (AlwaysOn Availability Groups) BOL article:

Read-only workloads use row versioning to remove blocking contention on the secondary databases. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set. Also, all locking hints are ignored. This eliminates reader/writer contention.

Q: Does using snapshot isolation prevent read-only queries from blocking the redo thread on secondary replicas?
A: After reading the above quote from BOL, you would think that read-only queries won’t block the redo thread, and that is true for DML statements. However, it is not true for DDL statements. Sure, snapshot isolation prevents read-only queries on the secondary replicas from taking locks and preventing other DML statements against the database from executing, but it doesn’t prevent the read-only queries from taking schema stability locks and blocking DDL statements.

If you take a look later on in the same article I quoted above, you will come to the “Read-Only Workload Impact” section under the “Performance Considerations” topic:

Also, read-only workloads on the secondary replicas can block data definition language (DDL) changes that are applied through log records. Even though the read operations do not take shared locks because of row versioning, these operations take schema stability (Sch-S) locks, which can block redo operations that are applying DDL changes.

Demos:

In order to have a reference architecture to use for this demo, I’m going to use the Availability Group that I setup as part of the presentation that I mentioned at the beginning of this article. If you would like to play along at home, you can download the scripts and slide deck for that session here. This Availability Group is called DemoAG, and has three replicas, AlwaysOn1 (primary), AlwaysOn2 (secondary), and AlwaysOn3 (secondary). There are two databases participating in the Availability Group: AdventureWorks2012 and AdventureWorksDW2012. I’m going to be using AdventureWorks2012 for this demo.

Testing isolation levels and lock hints:

Unfortunately, I haven’t come up with a good way to demonstrate that SQL is overriding the transaction isolation level on the readable secondary because it is actually doing this internally (I’ll keep looking into this to see if there is a good way to demonstrate it). I can, however, demonstrate that SQL Server is ignoring lock hints on the read-only replicas. I can do this by beginning a transaction, executing a select query with the tablockx lock hint, and leaving the transaction open. I then open a new query window and query sys.dm_tran_locks to see if there is a table lock on the table. If I do this on the primary, you will see an open table lock on the table. If I do this on the secondary, there is no open table lock. If I then try to update data on the primary replica, I am able to see the statement completes and that the data has been written to the replica, even with the open transaction on the secondary that should be holding a table lock.

-- Execute this in a session of its own on the primary replica:
begin tran;

select * from ErrorLog with (tablockx);

-- With the above transaction open, launch a new query window on
-- the primary replica, and execute this query
-- (you should notice a table lock on the ErrorLog table):
use master;

select
t.name
, dtr.*
from sys.dm_tran_locks dtr
join sys.databases d
on d.name = 'AdventureWorks2012'
and d.database_id = dtr.resource_database_id
join AdventureWorks2012.sys.tables t
on t.object_id = dtr.resource_associated_entity_id;

-- Now try running this insert in the same query window as the sys.dm_tran_locks query
-- (you should get blocked until you commit or rollback the transaction in the other session):
insert AdventureWorks2012.dbo.ErrorLog(UserName, ErrorNumber, ErrorMessage)
values('DemoUser', 1, 'Demo Message');

-- Go ahead and close both query windows, and connect to one of the secondary replicas.
-- Now, try running through the scenario above on the secondary replica (minus the insert).
-- You should notice that there is no table lock being held, even though we explicitly requested one.

Testing read-only queries blocking DDL statements:

As far as demonstrating that read-only queries can block DDL statements, all we need to do is run a long-running query on the secondary replica and simultaneously try to alter the table that the query on the secondary is using. We should notice that the redo queue size on the secondary then increases and stays that way until we either kill or complete the read-only query. This indicates that the redo thread is not able to commit the DDL changes because the select statement is taking a schema stability lock and preventing the alteration to the table.

-- Run this query on the secondary replica
-- (it's horribly ugly and inefficient, but the idea is to hold a
-- schema stability lock on ErrorLog for as long as we can):

select
NumbersTable.Number
from AdventureWorks2012.dbo.ErrorLog el
cross apply (
select
Number
from
(
select
row_number() over (order by s1.name) as number
from AdventureWorks2012.sys.sysobjects s1
cross apply AdventureWorks2012.sys.sysobjects s2
cross apply AdventureWorks2012.sys.sysobjects s3
cross apply AdventureWorks2012.sys.sysobjects s4
cross apply AdventureWorks2012.sys.sysobjects s5
) as InnerNumbersTable
) NumbersTable
group by NumbersTable.Number
order by NumbersTable.Number desc;

-- While that beast is running on the secondary replica, run the
-- following on the primary replica:

alter table AdventureWorks2012.dbo.ErrorLog add DemoColumn varchar(10) null;

-- Now go back to the secondary replica, open a new query window, and run the
-- query below (you should notice that the redo queue is greater than 0,
-- which indicates that the redo thread is having trouble applying changes
-- to the database):
select
dhdrs.redo_queue_size
from master.sys.dm_hadr_database_replica_states dhdrs
join master.sys.databases d
on d.database_id = dhdrs.database_id
and d.name = 'AdventureWorks2012';

10 thoughts on “AlwaysOn Availability Groups, Isolation Levels, & Selects Blocking the Redo Thread

  1. Great post!
    Assuming asynchroneous always-on, if we do online reindexing, will it be stock in the redo thread until the select on the readeable replicat is complete?

      • Hi Clement,

        I just tested this scenario, and the online index rebuild does get stuck in the redo queue of the asynchronous replica until the select statement running on the replica can finish and release the Sch-S lock.

        This is no different than doing these two things with a single database on a single instance. The rebuild needs to take a schema lock to complete the operation, even though it’s an online operation. The online option just tells SQL to build another copy of the index in parallel to the original (which doesn’t take a lock) and then it takes a schema lock to swaps the newly built index for the old one. Therefore, an online index rebuild is not actually a non-locking operation in its entirety.

        — Adam

    • Hi Tikiwan,

      Read-only queries on the Secondary Replica MUST be in Snapshot Isolation. This is a fundamental part of how Microsoft designed Readable Secondary Replicas in Availability Groups. This does means that your select statements will not block any other DML statements (Inserts, Updates, Deletes). However, in Snapshot Isolation (and even with the nolock hint, although Readable Secondary Replicas ignore the nolock hint), select statements take a Sch-S (Schema Stability) lock on the underlying table, and therefore will block any DDL statement (create, alter, drop, etc.) that would modify the schema of the underlying table. Unfortunately, there is no way around this behavior.

      — Adam

      • I always see default read-commit isolation level for the sessions running on Secondary replicas. is there any specific conditions, they use snapshot isolation on secondary replicas?

        • This is required behavior that is built into the product. If you enable a readable secondary replica, you are going to have all transactions on that replica happening under this isolation level, even if you ask for another isolation level. There is no way to change or override this behavior.

  2. Hi Adam,
    This is a great post. I was pointed here from SQLServerCentral.com by Gail Shaw. On the forums there I asked this question:
    “I am looking into AlwaysOn Availabilty Groups as both an HA and load spreading solution. But there is something I would like to clarify before I present to management.

    Our production databases suffer from lots of blocking and deadlocks. Will this resource contention be “replicated” on the replica databases when the logs are applied? I’d like to offload many of the read operations (adhoc reports, etc.) to the replicas but don’t just want to end up in the same situation as I am in now.”

    The SQL dev team in my company and I are working to improve performance on the production system, but it will be a long and arduous process. So I was trying to kill two birds with one stone – HADR and performance improvements.

    Hardware and licenses aren’t really an issue as we will be using existing SQL instances that are currently our logshipping secondaries. And, as mentioned, we are addressing the performance issues at source. This is legacy code, rife with nested cursors and repetitive/overlapping queries, that is being slowly reworked. Index tuning and maintenance is one of my main tasks which I carry out on a weekly basis.

    The implementation of availability groups is being planned as part of a wider improvement to high availability and disaster recovery. And, I thought it would be advantageous to further take advantage of the ability to perform read-only tasks (backups, some reporting) from the secondary servers. it is just likely that AG will be implemented before the code rewrite. So I didn’t want to make outlandish promises if the secondary databases will run into the same blocking/deadlocking problems as the primary ones.

    Any help on this would be greatly appreciated.

    Tom

    • Hi Tom,

      I’m sorry it has taken me so long to reply. I’ve been on the road more than I’ve been at home these past couple weeks. Anyway, as for your question about blocking and deadlocking with Availability Groups, it’s hard to say whether or not this will be more or less of an issue for you once you implement AGs, as it is heavily dependent upon your setup and your workload.

      My major concern is if you are using synchronous commit mode for any of your replicas. If you are, then blocking issues on your secondaries can and will impact the performance and execution of queries on the primary. If you are using asynchronous commit mode, then the primary will be able to hum along happily, even if blocking and deadlocking is occurring on the secondary replicas.

      As for the cause of that blocking and deadlocking, it is primarily caused when an alter statement or index maintenance is replicated to the secondaries which requires a schema lock to be taken. That schema lock will prevent any of the select statements that your read-only queries might be trying to do, and therefore the queries would wait for the lock to be released before they would execute (this isn’t too terribly bad). You run into real trouble when this happens in reverse, and the select statements that you’re running on the secondary blocks the redo thread from apply changes coming from the primary replica. In asynchronous commit mode, this would just cause the redo queue to grow on the secondary until the locks clear and the redo thread is able to apply the changes (this wouldn’t impact the primary replica). In synchronous commit mode, the select queries would block the ability to take the locks, and because the primary replica waits for changes to be replicated and applied successfully on all synchronous commit replicas before it completes execution locally, this can cause major timeout, blocking, and deadlocking issues on the primary replica.

      I hope this has helped clarify some of the reasons for and causes of blocking and deadlocking in an AlwaysOn AG topology. Please let me know if you have any additional questions about this.

      — Adam

Leave a Reply