SQL 2017 Distributor Changes

 

There’s not much new stuff in SQL 2017 for the replication administrator. One thing I did notice though is the distributor properties have changed. You can see 2 new options below:

Using sp_helpdistributor (or select * FROM msdb.dbo.MSdistributiondbs) you can more clearly see that there are 2 options to control deletions.

Remember that we retain the commands and transactions in the distribution database, so we’d expect that the deletions must relate to these records? There’s no mention of these parameters in the Books Online help guide so I did a little digging to confirm where they are used.

The distribution cleanup job executes dbo.sp_MSdistribution_cleanup.

This proc , through nested procs, calls:

(1) sp_MSdelete_publisherdb_trans

DELETE TOP(@deletebatchsize_commands) MSrepl_commands WITH (PAGLOCK) from MSrepl_commands

(2) sp_MSdelete_dodelete

delete TOP(@deletebatchsize_transactions) MSrepl_transactions WITH (PAGLOCK) from MSrepl_transactions

So – we previously only had a cutoff point to use in deleting old transactions. When there are millions of commands in this table it is time-consuming and prone to blocking issues. Now we have a neat way of doing batch deletes. I’ve recently been working with some tables having a billion rows where batching and decent indexing is essential so I can see exactly why this has been introduced.

Dropping Replicated Tables

I recently upgraded some of our SQL 2014 servers to SP2. When going through the list of hotfixes I came across this gem: “This article describes an improvement that allows a table that’s included as an article in transactional replication publication to be dropped from the database and the publication(s).”

You might have tried at some time to drop a replicated table. You’ll see this type of message…

Remember in SQL 2000 we used to have a similar case when altering columns on replicated tables unless we used sp_repladdcolumn and sp_repldropcolumn. Then along came the @replicate_ddl argument for sp_addpublication that made our lives easier and we could then make changes seamlessly. Well, there is now a similar possibility for dropping replicated tables:

exec sp_changepublication @publication = ‘<Publication Name>’, @property = ‘allow_drop‘, @value = ‘true’

There are some caveats to be aware of:

  • The publisher needs to be running SQL Server 2016 SP1 or 2014 SP2.
  • If the same article exists in more than one transactional publication, you have to enable this feature on all of them or you’ll still get the error that a table is in replication and can’t be dropped.
So, once you have set it up you can run the following type of statement with no errors:
drop table [dbo].[City]
Even if it is the last article in the publication and you have 100 subscribers this is still no problem.
To see if it has been enabled you can run
exec sp_helppublication ‘mainpublication’
There is a new column returned “allow_drop” which will tell you it’s there.
Unfortunately if you need to check this programatically there isn’t a nice column in syspublications which shows the value. It is hidden in the options bitmap. I suppose this was the easiest place to put it – hopefully it’ll be explicit in SQL 2017. The value to look for is
0x80 (Books Online is not yet updated with this value or the “allow_drop” in sp_helppublication (as of 13th Sep 2017)).

Dropping Transactional Articles

In MSDN in the article “Add Articles to and Drop Articles from Existing Publications” there is a section which says “For snapshot or transactional publications, articles can be dropped with no special considerations prior to subscriptions being created. If an article is dropped after one or more subscriptions is created, the subscriptions must be dropped, recreated, and synchronized.”. This is pretty scary and sounds almost like a reinitialization is needed!

Let’s see what happens when we use SSMS. If you drop an article from the publication you’ll see the following dialogue:

“This will not impact existing subscriptions” sounds reassuring, but is slightly ambiguous – perhaps it is only referring to the invalidation of the snapshot. However there is clearly no mention that “the subscriptions must be dropped, recreated, and synchronized”. If you click on “Yes” you’ll find that the next time you start the distribution agent all runs fine. There will be some orphaned objects left on the subscriber – stored procedures related to the article and the like – but nothing to worry about. So perhaps the MSDN article is incorrect?

Not really! The SSMS interface is clever enough to do the hard work for you. If you try to drop the article yourself using this type of script:

EXEC sp_droparticle
@publication = ‘pubtestarticle’,
@article = ‘tcity3’,
@force_invalidate_snapshot = 1;
GO

You’ll receive an error:

Msg 14046, Level 16, State 1, Procedure sp_MSrepl_droparticle, Line 286 [Batch Start Line 0]
Could not drop article. A subscription exists on it.

What you need to do is to manually drop the subscriptions it has and then you can drop the article.

EXEC sp_dropsubscription ‘pubtestarticle’,’tcity3′,’desktop-nsonkmo’,’transub’

This is what SSMS does behind the scenes and this is all the MSDN article is really telling us – there is no need to reinitialize, just drop the relevant subscriptions and then the article and then synchronize.

Dropping replicated tables…..impossible?

Did you ever try to drop a replicated table? Of course this is disallowed! It was always that way right?

If you do try it what you’ll see is a variation on the following message:

Well actually this has now changed and you can merrily drop the table.

It’s not enabled by default but if you run the following you’ll be permitted to do it.

EXEC sp_changepublication
@publication = ‘yourpublicationname’,
@property = N’allow_drop’,
@value = ‘true’
GO

If when you run the above you get a nasty error message as follows:

Msg 21183, Level 16, State 1, Procedure sp_MSrepl_changepublication, Line 274 [Batch Start Line 0]
Invalid property name ‘allow_drop’.

……that means that you have other problems: your SQL server is not up to date! Naughty DBA! This facility came in with SQL 2014 Service Pack 2  and SQL 2016 Service Pack 1. Sorry but nothing doing for SQL 2012/2008R2/2008!

Once it is set, you can happily drop tables in the publication database and the replication agents won’t error. The log reader will clean up the metadata related the article and ignore any remaining unprocessed commands related to the table. It also removes the article from the publication. The distribution agent processes the article records already sent to it by the log reader and when it reaches the cleanup command it has completed the processing for this article.

Note that the distribution agent doesn’t report the fact that the drop table has been processed during a synchronization. So we will not really be aware this has happened. Also the distribution agent doesn’t send the “Drop” statement down to the Subscriber either so the table is left there intact. This is quite unlike other DDL changes which keep the publisher and subscriber in sync. It’s safer this way but it’s a pity an additional option to send the delete doesn’t exist.

Finally, once it is set you’ll not see it so easily in syspublications. Another explicit column for this setting would have been nice but instead it is hidden in the logic of the options bitmap. However you can run the following to check if it is enabled:

EXEC sp_helppublication @publication = ‘yourpublicationname’

And then you’ll have a clear report of the flag setting….

 

Replicating to Memory-Optimised Tables

In SQL 2016 you might have wondered about the strange new setting when adding a new subscription:

This is the option for “Memory Optimised” tables and applies to Transactional and Snapshot subscriptions only. Memory Optimised tables are a newish (SQL 2014 and 2016) design feature which can be really useful for reporting purposes, and this fits in nicely with the use of many Transactional Replication systems, so it’s definitely something worth taking a look. As an additional bonus in SQL 2016 it’s also available in any edition.

Just a quick clarification that we are talking about replication subscribers here, not publishers. We can’t replicate memory optimised tables from the publisher – we can only create them at the subscriber.

First of all I’ll mention that selecting the option above when adding a subscription on its own does nothing! You’ll just get a normal table on the subscriber and we need to do a little more to get this to work.

You need to take a look at each table in the publication and set the article properties there. The 2 options you’ll need are shown below.

If you set “Enable Memory Optimisation” to true and leave the second option as false which is the default, most likely this will fail. This is because usually Primary Keys are created with a clustered index but Memory Optimised tables don’t play happily with clustered indexes. If your primary keys have a non-clustered index and there are no other clustered indexes on the table you’ll be ok, but this is a pretty unlikely scenario. Anyway, the wizard will check for you and you’ll get the following error if you’ve omitted the second option and it was required.

Really there’s no reason I can think of to leave the second option set to false.

So, having set the article properties and the subscription properties we are now almost ready to roll.

One final check – is the subscriber database set up to accommodate memory optimised tables? You can take a look in SSMS or run the following query replacing the “Northwind” with your database name to see:

select * from  [Northwind].sys.filegroups where type = ‘FX’

If nothing is returned, adapt and run the following replacing “Northwind” with your database name:

ALTER DATABASE northwind ADD FILEGROUP northwind_mod CONTAINS MEMORY_OPTIMIZED_DATA
ALTER DATABASE northwind ADD FILE (name=’northwind_mod2′, filename=’c:\data\northwind_mod2′) TO FILEGROUP northwind_mod
ALTER DATABASE northwind SET MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON

At this stage we are good to go. Create the subscription and after initialization run the following at the subscriber to check for any memory optimised tables.

SELECT OBJECT_NAME(object_id) as TableName, * FROM sys.dm_db_xtp_table_memory_stats;  

All being well you’ll get a row returned like the following and we’re done :).

 

Distribution agents and another reason to not use continuous schedules

Well strictly speaking this is a post to remind you to not accept the defaults when using continuous schedules. Although the problem I’ll describe never arises when you use a repeating schedule.

So – first of all are you running the distribution agent continuously? Remember back to when you set up the subscription there was an option mentioning this…

Can’t remember? Just take a look at the job schedule of the Distribution Agent. You’ll see the description will tell you “Start automatically when SQL Server Agent starts”.

This means a continuous schedule. It is the default and therefore is often the one selected. Also it provides the lowest latency so it seems like the obvious choice. However what if you ever hit a data consistency error where someone has messed up the subscriber data. It happens! If the replication command cannot be applied to the subscriber the job runs, retries, retries, retries and so on. No error is reported. Things just seem to hang. Perhaps it is the report user who tells us that the data seems stale, but being professionals we should know this before them!

The setting we need to be aware of is the one on the job step which runs the distribution agent. It is set to retry indefinitely by default. Well actually 2147483647 times but this amounts to indefinitely. Pretty soon the subscriber will default and have to be reinitialized if we don’t spot that there is a problem.

How does this look with a repeating schedule? Is it any different? Here it is:

Far more reasonable and after 10 mins we have our error, receive a pagerduty alert and then connect up and sort out the problem. Why is the number of retries different here compared to that set in a continuous schedule? Absolutely no reason I know of, and every system I’ve worked on I reset them as soon as I realise people are using continuous.

So – either reset to something reasonable (10 seems ok) or use a repeating schedule.

Solving data consistency errors and the need for monitoring….

Let’s consider what happens when transactional replication fails. Say there’s been a dodgy release and the application is now pointing at a replication subscriber where data changes have happened. Realising the problem, the application is re-pointed at the publisher database and when we start the distribution agent our problems begin to surface. We’ll see some info about retrying commands and after 10 retries the distribution agent will fail. I’m assuming you have the distribution agent on a timed schedule and there’ll be another post to explain why this is a good idea :). Failed inserts and updates are the usual issues we’ll see. This has happened to me in a production environment and there are a few ways of dealing with it:

(1) reinitialize. Most safe option but not always possible in production until we can negotiate some downtime. Sometimes we just need to get things working.

(2) use sp_setsubscriptionxactseqno. We just need to get hold of the transaction number (xact_seqno) of the most recent failing command and tell the distribution agent to skip it and move to the next command. We could use the -Output parameter in the distribution agent job (see here). This used to be a common way of getting xact_seqno numbers but the sp_helpsubscriptionerrors stored proc is much easier:

sp_helpsubscriptionerrors @publisher = ‘DESKTOP-NSONKMO’,
@publisher_db = ‘testtran’,
@publication = ‘ptesttran’,
@subscriber = ‘DESKTOP-NSONKMO\PAULSINST’,
@subscriber_db = ‘testtran’

The proc returns enough info to help us understand what has happened eg “Violation of PRIMARY KEY constraint ‘PK_tRegion’. Cannot insert duplicate key in object ‘dbo.tRegion’. The duplicate key value is (3).”. The main thing we need though is the xact_seqno which is also returned.

Then we run the following stored procedure at the Subscriber on the subscription database:

sp_setsubscriptionxactseqno @publisher = ‘DESKTOP-NSONKMO’
, @publisher_db = ‘testtran’
, @publication = ‘ptesttran’
, @xact_seqno = 0x00000022000005FF000300000000

Note that this can be a pain if there are a lot of errors. Each time you’ll skip the last tried command and then find the next xact_seqno and then skip that one. If there are a lot of errors you could be doing this for a while.

(3) we can set the distribution agent profile to skip all these type of data consistency errors – missing rows and already existing rows.

In fact this is what the previous admin had set as a default profile on the system I was looking at. Most often it is the fastest way to get things up and running and it avoids the iterative approach of sp_setsubscriptionxactseqno.

The issue I have is once this is set, there is no error recorded in the distribution agent while it happily skips commands which don’t work.

We should of course run regular validations, but in the meanwhile, if you are using this profile, what you need is a separate job which alerts you when commands have been skipped, and then you can take action. Just create a regular job which checks for consistency errors and sends the email if any are found. The syntax for the check is below:

SELECT id, time, error_text ,
error_code
FROM MSrepl_errors msre
WHERE msre.id IN (SELECT error_id
FROM MSdistribution_history)
and error_code in (2601,2627,20598)
ORDER by msre.time desc

 

Peer-to-Peer – the “hidden” column

Sometimes it’s useful to know how peer-to-peer really works under the bonnet. This is for troubleshooting. Or just out of interest if you’re feeling geeky :). To see how rows are tracked we’ll need to make sure the dedicated admin connection is working:

exec sp_configure ‘show advanced options’, 1
reconfigure
exec sp_configure ‘remote admin connections’, 1
reconfigure

Connect up to your SQL instance using the DAC eg Admin:Servername

Now you’ll have access to a special hidden column: $sys_p2p_cd_id. In the case of my table tRegion I run:

select *, $sys_p2p_cd_id as HiddenColumn from [TestP2Pdb]..tRegion

Lo and behold I can see some funky numbers in this column….

To find out what it means, and how this data is used, we run the following query:

select originator_node, res.* from
(
select *, $sys_p2p_cd_id as HiddenColumn
,sys.fn_replvarbintoint($sys_p2p_cd_id) as OriginatorID
,sys.fn_replp2pversiontotranid($sys_p2p_cd_id) as TranID
from tregion
) res
inner join dbo.MSpeer_originatorid_history poh
on res.OriginatorID = poh.originator_id

The query above reflects the fact that the hidden column $sys_p2p_cd_id actually holds 2 separate bits of info: the node which made the change and the id of the transaction itself.

The internal function sys.fn_replvarbintoint is particularly useful here as this returns the Originator_ID. In the query above I have joined the inner results to the MSpeer_originatorid_history table in order to get the actual peer hostname (originator_node) so you can see what is going on, but the numerical Originator_ID is still important itself. This is because it is used in the conflict resolution if p2p_continue_onconflict is set to true – the Originator_ID with the highest value wins in a conflict.

Looking a bit deeper…

At this stage you might ask how can this possibly work? If the TranID changes when a row changes, the value in $sys_p2p_cd_id must change, so the value of $sys_p2p_cd_id between one changed node and another (not changed) will naturally be different. If we look for differences in $sys_p2p_cd_id as an indication that there is a conflict, then all data changes should produce such a conflict!!!

Actually, the log reader here does something quite clever. When the data change is first made, the old and the new values of the $sys_p2p_cd_id column are stored in the MSRepl_Commands table. We can use sp_browsereplcmds to see that these values are added to the arguments for the internal stored procedure sp_MSupd_dbotRegion. This way the old value of $sys_p2p_cd_id can be checked against the value at the destination, because this should still be identical. If it isn’t then there is a conflict and we then go on to take a look at the 2 Originator_IDs and check for the highest value which will predominate.