Dropping replicated tables…..impossible?

Did you ever try to drop a replicated table? Of course this is disallowed! It was always that way right?

If you do try it what you’ll see is a variation on the following message:

Well actually this has now changed and you can merrily drop the table.

It’s not enabled by default but if you run the following you’ll be permitted to do it.

EXEC sp_changepublication
@publication = ‘yourpublicationname’,
@property = N’allow_drop’,
@value = ‘true’
GO

If when you run the above you get a nasty error message as follows:

Msg 21183, Level 16, State 1, Procedure sp_MSrepl_changepublication, Line 274 [Batch Start Line 0]
Invalid property name ‘allow_drop’.

……that means that you have other problems: your SQL server is not up to date! Naughty DBA! This facility came in with SQL 2014 Service Pack 2  and SQL 2016 Service Pack 1. Sorry but nothing doing for SQL 2012/2008R2/2008!

Once it is set, you can happily drop tables in the publication database and the replication agents won’t error. The log reader will clean up the metadata related the article and ignore any remaining unprocessed commands related to the table. It also removes the article from the publication. The distribution agent processes the article records already sent to it by the log reader and when it reaches the cleanup command it has completed the processing for this article.

Note that the distribution agent doesn’t report the fact that the drop table has been processed during a synchronization. So we will not really be aware this has happened. Also the distribution agent doesn’t send the “Drop” statement down to the Subscriber either so the table is left there intact. This is quite unlike other DDL changes which keep the publisher and subscriber in sync. It’s safer this way but it’s a pity an additional option to send the delete doesn’t exist.

Finally, once it is set you’ll not see it so easily in syspublications. Another explicit column for this setting would have been nice but instead it is hidden in the logic of the options bitmap. However you can run the following to check if it is enabled:

EXEC sp_helppublication @publication = ‘yourpublicationname’

And then you’ll have a clear report of the flag setting….

 

Replication Agents: Profiles and Parameters

OK – a few pointers to agent profiles, job parameters and such….

We usually find that our replication agents function quite nicely out of the box but at some stage we’ll need to tweak them. Perhaps they need to continue when an error is found, log more troubleshooting details to a file, have a longer query timeout and so on. This can be done by using one of the inbuilt profiles or by creating a custom one and setting up the values there. Either way we then assign the profile to the agent.

When we go down this route and take a look at agent profile parameters and compare to the list of agent executable parameters here it seems there is a big difference. Many are missing in our profile! First thing is to untick the checkbox at the bottom of the profile (below) and hey presto many more appear. Still not the full list of agent parameters, but 90% are there now. I’ll mention the missing 10% in a minute.

So, we create our profile, untick the checkbox,  set values and set the agent to use it. Now maybe we have 100s of agents and many different profiles. It is quite possible that local subscriptions are optimised to use a different profile than remote subscriptions which have flaky connections. Some agents should continue on data consistency errors while others should fail and alert us. And so on. This one-to-many relationship between the profiles and agents is not represented in the Replication Monitor -we have to go into each agent to see which profile it is using. You can use the query below to see what is what:

SELECT
[publication] as Publication
,c.srvname as SubscriberName
,b.profile_name as Profile
,[name] as DistributionJobName
FROM [distribution].[dbo].[MSdistribution_agents] a
inner join msdb.[dbo].[MSagent_profiles] b
on a.profile_id = b.profile_id
inner join master..sysservers c
on a.subscriber_id = c.srvid
order by b.profile_name

This returns a recordset which lists the agents and their profiles:

Remember that above I said 90% parameters are available when we untick the checkbox on the profile? There are some others such as “Output”, “Hostname”, “AltSnapshotFolder”, “UseInprocLoader” and a few more which have to be added to the job step directly as they don’t exist in the profile.

Now with this level of complexity – agents, different profiles, job step parameters and such we have a tricky situation. When an agent runs, which profile and parameters did it use, and what job values were used? We can use my query above to help, but perhaps the job parameter values have been changed since the job was run, and the agent might have been reset to use a different profile since then anyway, or the profile itself might have been changed. You can take a look in the distribution agent history but it is not detailed enough to help us and only gives basic info. To find out what our agent really used when it ran we need to check the job history.

You can see below that the job history separates the output into 2 logical sets of parameters – “user-supplied” and “from agent profile”. The profile name itself is not mentioned here (pity) but the particular values it used are, which anyway is the main level of detail we need.

One other point to note is that in the above example I’ve highlighted in blue the QueryTimeout parameter which has a value of 7000 seconds. Why? Because as an experiment I had added it to the agent job step and it was also in the profile with a different value of 3600 seconds. Now we see that the job step parameter takes priority and overrides the profile value when they are both entered. Really we should use one or the other and not mix the 2 methods and generally I’d advise to use the profile as it is more transparent.

I look at this output to confirm the values being used are the ones I expected to see. There are a few reasons for this. I might have many agents using a particular profile and just want to change one agent on an ad-hoc basis (and remove the parameter afterwards!), or I want to investigate why my custom profile value is not being applied or the parameter value is in the 10% not available in the profile and I just want to check my input value is being used…

Replicating to Memory-Optimised Tables

In SQL 2016 you might have wondered about the strange new setting when adding a new subscription:

This is the option for “Memory Optimised” tables and applies to Transactional and Snapshot subscriptions only. Memory Optimised tables are a newish (SQL 2014 and 2016) design feature which can be really useful for reporting purposes, and this fits in nicely with the use of many Transactional Replication systems, so it’s definitely something worth taking a look. As an additional bonus in SQL 2016 it’s also available in any edition.

Just a quick clarification that we are talking about replication subscribers here, not publishers. We can’t replicate memory optimised tables from the publisher – we can only create them at the subscriber.

First of all I’ll mention that selecting the option above when adding a subscription on its own does nothing! You’ll just get a normal table on the subscriber and we need to do a little more to get this to work.

You need to take a look at each table in the publication and set the article properties there. The 2 options you’ll need are shown below.

If you set “Enable Memory Optimisation” to true and leave the second option as false which is the default, most likely this will fail. This is because usually Primary Keys are created with a clustered index but Memory Optimised tables don’t play happily with clustered indexes. If your primary keys have a non-clustered index and there are no other clustered indexes on the table you’ll be ok, but this is a pretty unlikely scenario. Anyway, the wizard will check for you and you’ll get the following error if you’ve omitted the second option and it was required.

Really there’s no reason I can think of to leave the second option set to false.

So, having set the article properties and the subscription properties we are now almost ready to roll.

One final check – is the subscriber database set up to accommodate memory optimised tables? You can take a look in SSMS or run the following query replacing the “Northwind” with your database name to see:

select * from  [Northwind].sys.filegroups where type = ‘FX’

If nothing is returned, adapt and run the following replacing “Northwind” with your database name:

ALTER DATABASE northwind ADD FILEGROUP northwind_mod CONTAINS MEMORY_OPTIMIZED_DATA
ALTER DATABASE northwind ADD FILE (name=’northwind_mod2′, filename=’c:\data\northwind_mod2′) TO FILEGROUP northwind_mod
ALTER DATABASE northwind SET MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON

At this stage we are good to go. Create the subscription and after initialization run the following at the subscriber to check for any memory optimised tables.

SELECT OBJECT_NAME(object_id) as TableName, * FROM sys.dm_db_xtp_table_memory_stats;  

All being well you’ll get a row returned like the following and we’re done :).

 

Credentials and Proxies and why bother using them?

When we are setting up a subscriber there is a screen where we decide on the agent security.

The three dots ellipsis button above takes us to a detailed setup form. Most often I think DBAs select the highlighted option in the security form below.

Certainly the installations I’ve seen imply that this has always been chosen. This was the only possible setup used in SQL 2000 and despite the warning that “this is not a recommended security best practice” it is still definitely the easiest option.

In this scenario the distribution.exe or merge.exe is being run by the windows account that SQL Server Agent is set to use. This account is impersonated to give access to the distributor and the subscriber. Often the SQL Agent account is a sysadmin on the publisher and a local windows administrator. Sometimes this exact same account is also a sysadmin and local admin on the subscriber and in some extreme cases it is a domain admin. I’m not recommending any of these – just pointing out what seems commonplace. This is the crux of the issue as Microsoft always recommend using a least privilege user account for services.

But why bother with this least privilege philosophy for replication? Well, things can be run through the replication engine – nasty things. The most obvious example is if you take a look at sp_addscriptexec. You have to be db_owner to run it but essentially a db_owner can become sysadmin through this means. Another example would be all those text files in the snapshot directory. Another would be the DDL change scripts. What this really amounts to is the possibility to implement SQL-injection via replication.

So – if you really want to lock it all down and ensure that the replication engine cannot do too much harm in the hands of a malicious user we need to consider using credentials and proxies and get away from impersonating the SQL Agent account. A credential refers to details of a windows user and password. This has reduced privileges compared to the SQL Agent service account. If the credential account and the SQL Agent account have the same level of access to SQL and to the OS then you are going down the wrong street!

A proxy is a reference to the credential to be used for a particular function – “Replication Distributor” above. You might want the same credential to be used for merge and transactional agents – so in this case one credential and 2 proxies. Both credentials and proxies are easy to set up in SSMS (above) and you’ll just need to create the credential first and then the proxy.

Once this is done the job step that runs the particular subscriber agent will have another entry in the drop down – the name of the Proxy account that you added. In the case below I select the second job step of the distribution job to use the “DistributorProxy” proxy account (sorry about my naming convention!).

You might wonder how this all relates to the option below in the subscription wizard which has the option to state an alternative Windows account. Seems similar?

Well, this does all the Proxy/Credential/Job configuration for you in the background! The only problem is that if you leave it to the wizard you’ll end up with a mess of duplicates and horrible proxy and credential names like the one below, so it’s definitely worth designing it yourself.

Distribution agents and another reason to not use continuous schedules

Well strictly speaking this is a post to remind you to not accept the defaults when using continuous schedules. Although the problem I’ll describe never arises when you use a repeating schedule.

So – first of all are you running the distribution agent continuously? Remember back to when you set up the subscription there was an option mentioning this…

Can’t remember? Just take a look at the job schedule of the Distribution Agent. You’ll see the description will tell you “Start automatically when SQL Server Agent starts”.

This means a continuous schedule. It is the default and therefore is often the one selected. Also it provides the lowest latency so it seems like the obvious choice. However what if you ever hit a data consistency error where someone has messed up the subscriber data. It happens! If the replication command cannot be applied to the subscriber the job runs, retries, retries, retries and so on. No error is reported. Things just seem to hang. Perhaps it is the report user who tells us that the data seems stale, but being professionals we should know this before them!

The setting we need to be aware of is the one on the job step which runs the distribution agent. It is set to retry indefinitely by default. Well actually 2147483647 times but this amounts to indefinitely. Pretty soon the subscriber will default and have to be reinitialized if we don’t spot that there is a problem.

How does this look with a repeating schedule? Is it any different? Here it is:

Far more reasonable and after 10 mins we have our error, receive a pagerduty alert and then connect up and sort out the problem. Why is the number of retries different here compared to that set in a continuous schedule? Absolutely no reason I know of, and every system I’ve worked on I reset them as soon as I realise people are using continuous.

So – either reset to something reasonable (10 seems ok) or use a repeating schedule.

Solving data consistency errors and the need for monitoring….

Let’s consider what happens when transactional replication fails. Say there’s been a dodgy release and the application is now pointing at a replication subscriber where data changes have happened. Realising the problem, the application is re-pointed at the publisher database and when we start the distribution agent our problems begin to surface. We’ll see some info about retrying commands and after 10 retries the distribution agent will fail. I’m assuming you have the distribution agent on a timed schedule and there’ll be another post to explain why this is a good idea :). Failed inserts and updates are the usual issues we’ll see. This has happened to me in a production environment and there are a few ways of dealing with it:

(1) reinitialize. Most safe option but not always possible in production until we can negotiate some downtime. Sometimes we just need to get things working.

(2) use sp_setsubscriptionxactseqno. We just need to get hold of the transaction number (xact_seqno) of the most recent failing command and tell the distribution agent to skip it and move to the next command. We could use the -Output parameter in the distribution agent job (see here). This used to be a common way of getting xact_seqno numbers but the sp_helpsubscriptionerrors stored proc is much easier:

sp_helpsubscriptionerrors @publisher = ‘DESKTOP-NSONKMO’,
@publisher_db = ‘testtran’,
@publication = ‘ptesttran’,
@subscriber = ‘DESKTOP-NSONKMO\PAULSINST’,
@subscriber_db = ‘testtran’

The proc returns enough info to help us understand what has happened eg “Violation of PRIMARY KEY constraint ‘PK_tRegion’. Cannot insert duplicate key in object ‘dbo.tRegion’. The duplicate key value is (3).”. The main thing we need though is the xact_seqno which is also returned.

Then we run the following stored procedure at the Subscriber on the subscription database:

sp_setsubscriptionxactseqno @publisher = ‘DESKTOP-NSONKMO’
, @publisher_db = ‘testtran’
, @publication = ‘ptesttran’
, @xact_seqno = 0x00000022000005FF000300000000

Note that this can be a pain if there are a lot of errors. Each time you’ll skip the last tried command and then find the next xact_seqno and then skip that one. If there are a lot of errors you could be doing this for a while.

(3) we can set the distribution agent profile to skip all these type of data consistency errors – missing rows and already existing rows.

In fact this is what the previous admin had set as a default profile on the system I was looking at. Most often it is the fastest way to get things up and running and it avoids the iterative approach of sp_setsubscriptionxactseqno.

The issue I have is once this is set, there is no error recorded in the distribution agent while it happily skips commands which don’t work.

We should of course run regular validations, but in the meanwhile, if you are using this profile, what you need is a separate job which alerts you when commands have been skipped, and then you can take action. Just create a regular job which checks for consistency errors and sends the email if any are found. The syntax for the check is below:

SELECT id, time, error_text ,
error_code
FROM MSrepl_errors msre
WHERE msre.id IN (SELECT error_id
FROM MSdistribution_history)
and error_code in (‘2601′,’2627′,’20598’)
ORDER by msre.time desc

MSrepl_errors.error_code is of type sysname and may contain non-integer values. (for example: ‘HYT00′ and ’08S01′). So your query should specify the error values as string literals to avoid error:

Msg 245, Level 16, State 1, Line 1
Conversion failed when converting the nvarchar value ’08S01’ to data type int.

Scheduling Validations

In merge we want to check that the data is in sync. This is done by setting up a validation:

Checking the rowcounts and the checksums is the only surefire way to get a comparison:

Nothing happens until we run the merge agent and synchronize. At that point if the data is out of sync the job fails and we’ll get an error in the merge agent history.

The best thing to do is to have a separate job which regularly triggers the validation. I have one which runs each Monday morning and has the following type of text in it:

exec sp_validatemergepublication @publication= ‘pubTestMerge’,  @level = 3

As a final point if the merge agent is run again afterwards, there is a successful sync and no mention of the invalid data.

So – perhaps the support team re-run the agent and the important info is lost. This sort of thing is done quite often because the merge agent might fail because of an innocuous network issue. Just something to be aware of.

Peer-to-Peer – the “hidden” column

Sometimes it’s useful to know how peer-to-peer really works under the bonnet. This is for troubleshooting. Or just out of interest if you’re feeling geeky :). To see how rows are tracked we’ll need to make sure the dedicated admin connection is working:

exec sp_configure ‘show advanced options’, 1
reconfigure
exec sp_configure ‘remote admin connections’, 1
reconfigure

Connect up to your SQL instance using the DAC eg Admin:Servername

Now you’ll have access to a special hidden column: $sys_p2p_cd_id. In the case of my table tRegion I run:

select *, $sys_p2p_cd_id as HiddenColumn from [TestP2Pdb]..tRegion

Lo and behold I can see some funky numbers in this column….

To find out what it means, and how this data is used, we run the following query:

select originator_node, res.* from
(
select *, $sys_p2p_cd_id as HiddenColumn
,sys.fn_replvarbintoint($sys_p2p_cd_id) as OriginatorID
,sys.fn_replp2pversiontotranid($sys_p2p_cd_id) as TranID
from tregion
) res
inner join dbo.MSpeer_originatorid_history poh
on res.OriginatorID = poh.originator_id

The query above reflects the fact that the hidden column $sys_p2p_cd_id actually holds 2 separate bits of info: the node which made the change and the id of the transaction itself.

The internal function sys.fn_replvarbintoint is particularly useful here as this returns the Originator_ID. In the query above I have joined the inner results to the MSpeer_originatorid_history table in order to get the actual peer hostname (originator_node) so you can see what is going on, but the numerical Originator_ID is still important itself. This is because it is used in the conflict resolution if p2p_continue_onconflict is set to true – the Originator_ID with the highest value wins in a conflict.

Looking a bit deeper…

At this stage you might ask how can this possibly work? If the TranID changes when a row changes, the value in $sys_p2p_cd_id must change, so the value of $sys_p2p_cd_id between one changed node and another (not changed) will naturally be different. If we look for differences in $sys_p2p_cd_id as an indication that there is a conflict, then all data changes should produce such a conflict!!!

Actually, the log reader here does something quite clever. When the data change is first made, the old and the new values of the $sys_p2p_cd_id column are stored in the MSRepl_Commands table. We can use sp_browsereplcmds to see that these values are added to the arguments for the internal stored procedure sp_MSupd_dbotRegion. This way the old value of $sys_p2p_cd_id can be checked against the value at the destination, because this should still be identical. If it isn’t then there is a conflict and we then go on to take a look at the 2 Originator_IDs and check for the highest value which will predominate.