Solving data consistency errors and the need for monitoring….

Let’s consider what happens when transactional replication fails. Say there’s been a dodgy release and the application is now pointing at a replication subscriber where data changes have happened. Realising the problem, the application is re-pointed at the publisher database and when we start the distribution agent our problems begin to surface. We’ll see some info about retrying commands and after 10 retries the distribution agent will fail. I’m assuming you have the distribution agent on a timed schedule and there’ll be another post to explain why this is a good idea :). Failed inserts and updates are the usual issues we’ll see. This has happened to me in a production environment and there are a few ways of dealing with it:

(1) reinitialize. Most safe option but not always possible in production until we can negotiate some downtime. Sometimes we just need to get things working.

(2) use sp_setsubscriptionxactseqno. We just need to get hold of the transaction number (xact_seqno) of the most recent failing command and tell the distribution agent to skip it and move to the next command. We could use the -Output parameter in the distribution agent job (see here). This used to be a common way of getting xact_seqno numbers but the sp_helpsubscriptionerrors stored proc is much easier:

sp_helpsubscriptionerrors @publisher = ‘DESKTOP-NSONKMO’,
@publisher_db = ‘testtran’,
@publication = ‘ptesttran’,
@subscriber = ‘DESKTOP-NSONKMO\PAULSINST’,
@subscriber_db = ‘testtran’

The proc returns enough info to help us understand what has happened eg “Violation of PRIMARY KEY constraint ‘PK_tRegion’. Cannot insert duplicate key in object ‘dbo.tRegion’. The duplicate key value is (3).”. The main thing we need though is the xact_seqno which is also returned.

Then we run the following stored procedure at the Subscriber on the subscription database:

sp_setsubscriptionxactseqno @publisher = ‘DESKTOP-NSONKMO’
, @publisher_db = ‘testtran’
, @publication = ‘ptesttran’
, @xact_seqno = 0x00000022000005FF000300000000

Note that this can be a pain if there are a lot of errors. Each time you’ll skip the last tried command and then find the next xact_seqno and then skip that one. If there are a lot of errors you could be doing this for a while.

(3) we can set the distribution agent profile to skip all these type of data consistency errors – missing rows and already existing rows.

In fact this is what the previous admin had set as a default profile on the system I was looking at. Most often it is the fastest way to get things up and running and it avoids the iterative approach of sp_setsubscriptionxactseqno.

The issue I have is once this is set, there is no error recorded in the distribution agent while it happily skips commands which don’t work.

We should of course run regular validations, but in the meanwhile, if you are using this profile, what you need is a separate job which alerts you when commands have been skipped, and then you can take action. Just create a regular job which checks for consistency errors and sends the email if any are found. The syntax for the check is below:

SELECT id, time, error_text ,
error_code
FROM MSrepl_errors msre
WHERE msre.id IN (SELECT error_id
FROM MSdistribution_history)
and error_code in (2601,2627,20598)
ORDER by msre.time desc

 

Leave a Reply

Your email address will not be published.