Credentials and Proxies and why bother using them?

When we are setting up a subscriber there is a screen where we decide on the agent security.

The three dots ellipsis button above takes us to a detailed setup form. Most often I think DBAs select the highlighted option in the security form below.

Certainly the installations I’ve seen imply that this has always been chosen. This was the only possible setup used in SQL 2000 and despite the warning that “this is not a recommended security best practice” it is still definitely the easiest option.

In this scenario the distribution.exe or merge.exe is being run by the windows account that SQL Server Agent is set to use. This account is impersonated to give access to the distributor and the subscriber. Often the SQL Agent account is a sysadmin on the publisher and a local windows administrator. Sometimes this exact same account is also a sysadmin and local admin on the subscriber and in some extreme cases it is a domain admin. I’m not recommending any of these – just pointing out what seems commonplace. This is the crux of the issue as Microsoft always recommend using a least privilege user account for services.

But why bother with this least privilege philosophy for replication? Well, things can be run through the replication engine – nasty things. The most obvious example is if you take a look at sp_addscriptexec. You have to be db_owner to run it but essentially a db_owner can become sysadmin through this means. Another example would be all those text files in the snapshot directory. Another would be the DDL change scripts. What this really amounts to is the possibility to implement SQL-injection via replication.

So – if you really want to lock it all down and ensure that the replication engine cannot do too much harm in the hands of a malicious user we need to consider using credentials and proxies and get away from impersonating the SQL Agent account. A credential refers to details of a windows user and password. This has reduced privileges compared to the SQL Agent service account. If the credential account and the SQL Agent account have the same level of access to SQL and to the OS then you are going down the wrong street!

A proxy is a reference to the credential to be used for a particular function – “Replication Distributor” above. You might want the same credential to be used for merge and transactional agents – so in this case one credential and 2 proxies. Both credentials and proxies are easy to set up in SSMS (above) and you’ll just need to create the credential first and then the proxy.

Once this is done the job step that runs the particular subscriber agent will have another entry in the drop down – the name of the Proxy account that you added. In the case below I select the second job step of the distribution job to use the “DistributorProxy” proxy account (sorry about my naming convention!).

You might wonder how this all relates to the option below in the subscription wizard which has the option to state an alternative Windows account. Seems similar?

Well, this does all the Proxy/Credential/Job configuration for you in the background! The only problem is that if you leave it to the wizard you’ll end up with a mess of duplicates and horrible proxy and credential names like the one below, so it’s definitely worth designing it yourself.

Distribution agents and another reason to not use continuous schedules

Well strictly speaking this is a post to remind you to not accept the defaults when using continuous schedules. Although the problem I’ll describe never arises when you use a repeating schedule.

So – first of all are you running the distribution agent continuously? Remember back to when you set up the subscription there was an option mentioning this…

Can’t remember? Just take a look at the job schedule of the Distribution Agent. You’ll see the description will tell you “Start automatically when SQL Server Agent starts”.

This means a continuous schedule. It is the default and therefore is often the one selected. Also it provides the lowest latency so it seems like the obvious choice. However what if you ever hit a data consistency error where someone has messed up the subscriber data. It happens! If the replication command cannot be applied to the subscriber the job runs, retries, retries, retries and so on. No error is reported. Things just seem to hang. Perhaps it is the report user who tells us that the data seems stale, but being professionals we should know this before them!

The setting we need to be aware of is the one on the job step which runs the distribution agent. It is set to retry indefinitely by default. Well actually 2147483647 times but this amounts to indefinitely. Pretty soon the subscriber will default and have to be reinitialized if we don’t spot that there is a problem.

How does this look with a repeating schedule? Is it any different? Here it is:

Far more reasonable and after 10 mins we have our error, receive a pagerduty alert and then connect up and sort out the problem. Why is the number of retries different here compared to that set in a continuous schedule? Absolutely no reason I know of, and every system I’ve worked on I reset them as soon as I realise people are using continuous.

So – either reset to something reasonable (10 seems ok) or use a repeating schedule.

Solving data consistency errors and the need for monitoring….

Let’s consider what happens when transactional replication fails. Say there’s been a dodgy release and the application is now pointing at a replication subscriber where data changes have happened. Realising the problem, the application is re-pointed at the publisher database and when we start the distribution agent our problems begin to surface. We’ll see some info about retrying commands and after 10 retries the distribution agent will fail. I’m assuming you have the distribution agent on a timed schedule and there’ll be another post to explain why this is a good idea :). Failed inserts and updates are the usual issues we’ll see. This has happened to me in a production environment and there are a few ways of dealing with it:

(1) reinitialize. Most safe option but not always possible in production until we can negotiate some downtime. Sometimes we just need to get things working.

(2) use sp_setsubscriptionxactseqno. We just need to get hold of the transaction number (xact_seqno) of the most recent failing command and tell the distribution agent to skip it and move to the next command. We could use the -Output parameter in the distribution agent job (see here). This used to be a common way of getting xact_seqno numbers but the sp_helpsubscriptionerrors stored proc is much easier:

sp_helpsubscriptionerrors @publisher = ‘DESKTOP-NSONKMO’,
@publisher_db = ‘testtran’,
@publication = ‘ptesttran’,
@subscriber = ‘DESKTOP-NSONKMO\PAULSINST’,
@subscriber_db = ‘testtran’

The proc returns enough info to help us understand what has happened eg “Violation of PRIMARY KEY constraint ‘PK_tRegion’. Cannot insert duplicate key in object ‘dbo.tRegion’. The duplicate key value is (3).”. The main thing we need though is the xact_seqno which is also returned.

Then we run the following stored procedure at the Subscriber on the subscription database:

sp_setsubscriptionxactseqno @publisher = ‘DESKTOP-NSONKMO’
, @publisher_db = ‘testtran’
, @publication = ‘ptesttran’
, @xact_seqno = 0x00000022000005FF000300000000

Note that this can be a pain if there are a lot of errors. Each time you’ll skip the last tried command and then find the next xact_seqno and then skip that one. If there are a lot of errors you could be doing this for a while.

(3) we can set the distribution agent profile to skip all these type of data consistency errors – missing rows and already existing rows.

In fact this is what the previous admin had set as a default profile on the system I was looking at. Most often it is the fastest way to get things up and running and it avoids the iterative approach of sp_setsubscriptionxactseqno.

The issue I have is once this is set, there is no error recorded in the distribution agent while it happily skips commands which don’t work.

We should of course run regular validations, but in the meanwhile, if you are using this profile, what you need is a separate job which alerts you when commands have been skipped, and then you can take action. Just create a regular job which checks for consistency errors and sends the email if any are found. The syntax for the check is below:

SELECT id, time, error_text ,
error_code
FROM MSrepl_errors msre
WHERE msre.id IN (SELECT error_id
FROM MSdistribution_history)
and error_code in (2601,2627,20598)
ORDER by msre.time desc

 

Scheduling Validations

In merge we want to check that the data is in sync. This is done by setting up a validation:

Checking the rowcounts and the checksums is the only surefire way to get a comparison:

Nothing happens until we run the merge agent and synchronize. At that point if the data is out of sync the job fails and we’ll get an error in the merge agent history.

The best thing to do is to have a separate job which regularly triggers the validation. I have one which runs each Monday morning and has the following type of text in it:

exec sp_validatemergepublication @publication= ‘pubTestMerge’,  @level = 3

As a final point if the merge agent is run again afterwards, there is a successful sync and no mention of the invalid data.

So – perhaps the support team re-run the agent and the important info is lost. This sort of thing is done quite often because the merge agent might fail because of an innocuous network issue. Just something to be aware of.

Peer-to-Peer – the “hidden” column

Sometimes it’s useful to know how peer-to-peer really works under the bonnet. This is for troubleshooting. Or just out of interest if you’re feeling geeky :). To see how rows are tracked we’ll need to make sure the dedicated admin connection is working:

exec sp_configure ‘show advanced options’, 1
reconfigure
exec sp_configure ‘remote admin connections’, 1
reconfigure

Connect up to your SQL instance using the DAC eg Admin:Servername

Now you’ll have access to a special hidden column: $sys_p2p_cd_id. In the case of my table tRegion I run:

select *, $sys_p2p_cd_id as HiddenColumn from [TestP2Pdb]..tRegion

Lo and behold I can see some funky numbers in this column….

To find out what it means, and how this data is used, we run the following query:

select originator_node, res.* from
(
select *, $sys_p2p_cd_id as HiddenColumn
,sys.fn_replvarbintoint($sys_p2p_cd_id) as OriginatorID
,sys.fn_replp2pversiontotranid($sys_p2p_cd_id) as TranID
from tregion
) res
inner join dbo.MSpeer_originatorid_history poh
on res.OriginatorID = poh.originator_id

The query above reflects the fact that the hidden column $sys_p2p_cd_id actually holds 2 separate bits of info: the node which made the change and the id of the transaction itself.

The internal function sys.fn_replvarbintoint is particularly useful here as this returns the Originator_ID. In the query above I have joined the inner results to the MSpeer_originatorid_history table in order to get the actual peer hostname (originator_node) so you can see what is going on, but the numerical Originator_ID is still important itself. This is because it is used in the conflict resolution if p2p_continue_onconflict is set to true – the Originator_ID with the highest value wins in a conflict.

Looking a bit deeper…

At this stage you might ask how can this possibly work? If the TranID changes when a row changes, the value in $sys_p2p_cd_id must change, so the value of $sys_p2p_cd_id between one changed node and another (not changed) will naturally be different. If we look for differences in $sys_p2p_cd_id as an indication that there is a conflict, then all data changes should produce such a conflict!!!

Actually, the log reader here does something quite clever. When the data change is first made, the old and the new values of the $sys_p2p_cd_id column are stored in the MSRepl_Commands table. We can use sp_browsereplcmds to see that these values are added to the arguments for the internal stored procedure sp_MSupd_dbotRegion. This way the old value of $sys_p2p_cd_id can be checked against the value at the destination, because this should still be identical. If it isn’t then there is a conflict and we then go on to take a look at the 2 Originator_IDs and check for the highest value which will predominate.

Peer-to-Peer – interesting insert-insert conflicts

Understanding conflict resolution is important in peer-to-peer if you intend to enable it. There is an argument that you should enable conflict detection and not the resolution. Knowing that a conflict has happened allows you to investigate and then to initialize one of the nodes after plugging the hole in the application logic. Even so, some people have conflict resolution enabled. It certainly helps with update-update conflicts (provided you assign the OriginatorIDs correctly – see related post on this site). For insert-insert conflicts you’d expect a failure which can’t be resolved. That’s what used to happen.

Consider a simple table tRegion. We have a PK on the RegionID column and one site inserts “Rome” and the other node “Paris”. Both have a RegionID value of 2 so there will be a conflict when we synchronize.

Publisher:

Subscriber:

Synchronization at each site looks pretty innocuous:

The nodes are defined as below – the publisher has an OriginatorID of 100 and the subscriber a value of 1 so the publisher will win any conflicts.

The conflict viewers show us an interesting story:

The crucial bit is that “The incoming insert was applied to peer 1”!!! In other words we now have data in sync and both regions are “Rome”. How is this possible?

Well the magic is in the insert stored procedure -“sp_MSins_dbotRegion1335813640” in this case. It will convert the insert into an update for the lower OriginatorID where there is an insert-insert conflict :).

Note that this could have unintended consequences! Foreign key records continue to point at this PK record. We could have eg the case that all the Paris streets added at the subscriber now belong to the Rome region. So – be very careful if you are going to use conflict resolution!

Replicating Indexes, Constraints, Triggers and so on

One thing we need to be sure of is that the subscriber and publisher “look” the same. This is from the DBA point of view. Are all aspects of the table the same at the subscriber?

We seem to be able to control this in the article properties. The defaults say that the FKs, Indexes, Triggers, Defaults etc go to the subscriber:

There is a related setting on the subscriber properties for the publication that says we want to replicate schema changes – by default set to true:

However, note that after initialization when we make a change on the publisher, not everything will go down. We can see what is ready to go in sysmergeschemachange. For the main changes we need to remember:

  • Replicated to the Subscriber: Primary Keys, Foreign Keys, Check Constraints, Default Constraints
  • Not Replicated to the Subscriber: Indexes, Triggers

So – be careful to manually keep the subscriber in sync when it comes to indexes and triggers!

Understanding Identity Ranges – when do they fill up?

For a table which is merge replicated and has an identity column what happens – how does it all work? Looking at the article properties we can see the range sizes allocated to the publisher and subscriber. Each one will get an allocated range with a different starting point to make sure they don’t overlap. Notice that the publisher range is a lot bigger than the subscriber but we can sort this out before publishing the table.

So – now for a quick question – what happens if I stop the merge agent from synchronizing and insert a single row in a loop at the publisher and do the same at the subscriber. Let’s say I try to insert 50,000 records this way. Remember the declared range sizes above are 10,000 and 1,000.

Do you have an idea in mind? Will it work or fail? If it fails, at what point will it fail?

The answer, which may be surprising:

  • Publisher will insert all 50,000 records
  • Subscriber will insert 2000 records then fail with the following message:

Msg 548, Level 16, State 2, Line 5
The insert failed. It conflicted with an identity range check constraint in database ‘TestRepSub’, replicated table ‘dbo.tCities’, column ‘ID’. If the identity column is automatically managed by replication, update the range as follows: for the Publisher, execute sp_adjustpublisheridentityrange; for the Subscriber, run the Distribution Agent or the Merge Agent.
The statement has been terminated.

This is interesting. First of all we can see that the range on the subscriber is actually twice the size we declared. This is always the case. Secondly, the publisher is able to assign a new range when needed on the fly, unlike the subscriber. For the subscriber we really need to synchronize to get the next range.

The logic for this is all in the merge triggers and check constraints for those interested.

One final point is that this scenario involves single updates. You’ll have a big problem if your code does the following sort of logic to insert a lot of rows:

insert into tcities ([City])
select city from tcities

Such an insert will fail even on the publisher and requires reworking of the logic to reduce the batch size. Make sure you hit the upper limit of the declared identity range in the insert and you’ll get a new range allocated. In reality, the only way to ensure this is by using single row inserts.