Well strictly speaking this is a post to remind you to not accept the defaults when using continuous schedules. Although the problem I’ll describe never arises when you use a repeating schedule.
So – first of all are you running the distribution agent continuously? Remember back to when you set up the subscription there was an option mentioning this…
Can’t remember? Just take a look at the job schedule of the Distribution Agent. You’ll see the description will tell you “Start automatically when SQL Server Agent starts”.
This means a continuous schedule. It is the default and therefore is often the one selected. Also it provides the lowest latency so it seems like the obvious choice. However what if you ever hit a data consistency error where someone has messed up the subscriber data. It happens! If the replication command cannot be applied to the subscriber the job runs, retries, retries, retries and so on. No error is reported. Things just seem to hang. Perhaps it is the report user who tells us that the data seems stale, but being professionals we should know this before them!
The setting we need to be aware of is the one on the job step which runs the distribution agent. It is set to retry indefinitely by default. Well actually 2147483647 times but this amounts to indefinitely. Pretty soon the subscriber will default and have to be reinitialized if we don’t spot that there is a problem.
How does this look with a repeating schedule? Is it any different? Here it is:
Far more reasonable and after 10 mins we have our error, receive a pagerduty alert and then connect up and sort out the problem. Why is the number of retries different here compared to that set in a continuous schedule? Absolutely no reason I know of, and every system I’ve worked on I reset them as soon as I realise people are using continuous.
So – either reset to something reasonable (10 seems ok) or use a repeating schedule.