In this article we'll look at conflicts and errors that can occur in a Peer-to-Peer (P2P) setup and how to solve them. I know that in BOL we're told that ideally conflicts should not occur in P2P, and to be honest I'd thought that this is the case out there “in the trenches”. The thinking is that partitioned data changes should ensure that there are no conflicts, so such correct design will safeguard the system. However recently I've been approached by 2 clients who don't have such a flawless design and they need to know more about conflicts (and identity columns but that's another story :)) because they have issues in this area. The other part of the article deals with errors which may occur during schema changes when the system isn't fully quiesced - more on this later.
Consider the case where we have 3 P2P nodes – “Node3”, “Node6” and “Node10”.
The publications are all set to allow detection and to continue after conflict detection:
An update takes place on the same row on “Node3” and “Node6” simultaneously. On these 2 nodes an update transaction will be reported in the history of the distribution agent in the Replication Monitor. No errors will be reported, and there will be no mention of a conflict, although on examination you’ll see that the data has been converged. We can see details of the conflict in the conflict viewer, and polling the conflict tables dbo.conflict_dbo_tablename could be used as a means to alert us that some important event has occurred:
Note that this is different to the merge replication implementation of the conflict viewer – here we see historical info regarding the conflict but we have no option to reverse the decision about the winner and loser – we can remove the conflict but nothing more. The conflict viewer is available on each node and will report similar but not identical information there. To understand the messages and also why they are different (the text in the middle of the conflict viewer) we must take a step back and consider if the data is converged, who decided which node is to be the winner?
So, run the query below on any node and you can see the priority order. The highest originator_id is the most important and will win in a conflict against a node with a lower originator_id. Note that this choice of the initial value for the originator_id which is decided when you create the setup can’t be altered afterwards! If you configure the first node with a low value but this is the main node in the setup then you’ll have to tear it down and set it all up again so this needs to be considered right at the start!
SELECT originator_id, originator_node FROM dbo.MSpeer_originatorid_history
Similarly, the originator_id is visible graphically in the “Configure Peer-To-Peer Topology” wizard.
Back to the messages then. They will look similar to the ones below:
on Node 6
An update-update conflict between peer 3 (incoming) and peer 6 (on disk) was detected and resolved. The incoming update was skipped by peer 6.
on Node 3
An update-update conflict between peer 6 (incoming) and peer 3 (on disk) was detected and resolved. The incoming update was applied to peer 3.
I've underlined the key word in the message - on the lower originator_id node, the conflict text will report that it has been “applied”, while the higher originator_id will report “skipped”. The “peer 3” and “peer 6” mentioned in the text is a reference to the originator_id and we can use the MSpeer_originatorid_history table to determine which physical host this is really referring to.
This can be configured using the publication properties or using p2p_continue_onconflict of sp_changepublication and setting it to FALSE. Actually this is the recommended setting for reasons I’ll cover later on.
In the case of our update/update conflict the distribution agent now fails and as administrators we therefore have much higher visibility of these issues.
The error message will appear like that below:
if @@trancount > 0 rollback tran
(Transaction sequence number: 0x00000033000001C1000500000000, Command ID: 1)
A conflict of type 'Update-Update' was detected at peer 10 between peer 3 (incoming), transaction id 0x00000000000031ed and peer 6 (on disk), transaction id 0x0000000000003164 (Source: MSSQLServer, Error number: 22815)
If we want to get more info regarding this transaction, we can use the trusted sp_browsereplcmds – as usual, we strip off the trailing eight zeros from the transaction sequence number and run the following:
EXEC distribution..sp_browsereplcmds @xact_seqno_start = '0x00000033000001C10005', @xact_seqno_end = '0x00000033000001C10005'
In the returned recordset, the command field shows details of the failing update. So, we can see that there is a problem, what the relevant command is and we now need to get the system up and going. As it is an update/update conflict we might make a judgement call and reset the p2p_continue_onconflict value to true and allow convergence to take place. If we are lucky, this’ll fix the issue. Afterwards, run a validation test to check that the data has converged and all is well, and examine any other conflicts and agent history. You may see warnings in the Replication Monitor which are there because it’s been a while since the distribution agent ran successfully but inserting a tracer token where this is the case will remove the warning.
If after this process the data is converged, but it is not correct (eg because the originator_ids are not set in the correct priority order to propagate the "correct" update statement in preference to the other update) then we could still try to fix it using the fact that the highest originator_id always wins, so we make the required changes on this node, which then propagates the correct data to the other nodes. If neither of these methods work, (this can be the case in morte complex setups because of triggers and FK constraints) we’d have to take a backup and reinitialize the nodes :(.
Consider the case where there is an update on Node6 and a delete on Node3. As the update is performed on the node with the higher originator_id then we have an issue because the update won’t be able to be applied to Node3 because this node has already removed the required record! The p2p_continue_onconflict setting won’t help us here as the conflict can’t be solved. If we have p2p_continue_onconflict set to TRUE, we’ll see a message like the following:
“An update-delete conflict was detected and unresolved. The row could not be updated since the row does not exist. The incoming update was skipped. Check the priority of the destination peer and run data validation to ensure the delete conflict did not result in data non-convergence.”.
In fact, this is exactly the problem – we have achieved the the dreaded non-convergence situation. So, what can we do? Well we couldn’t insert the row back on Node3 because then we’d get a PK conflict on Node6 which already has the row. Note however that the error message tells us the update was "skipped", which means it won’t be retried. This means that we can fix this issue by deleting the record off Node6 and then reinserting it back on Node6 which will propagate without any errors or conflicts. If you have a more complex setup and have FK records relating to this record, then removing it could become difficult to fix and once again you might end up having to reinitialize using a backup.
We can see in BOL that if the publication is in a P2P topology, the system must be quiesced before making schema changes. The recommendations are to run sp_requestpeerresponse and sp_helppeerrequests to determine that the system is quiesced which is pretty much like using tracer tokens throughout the setup. However why is this needed and what happens if the system is not quiesced? We might think that the system is quiesced but a pesky user accesses the system while we are making the schema changes - what then? Normally nothing bad will happen but there is an edge case where there could be issues. Consider if you are dropping a column on one node and someone has done an update to this same column on another node. Now we will have a problem.
The Node where the update is attempted will have the error:
Invalid column name 'ColumnName'. (Source: MSSQLServer, Error number: 207)
This is not related to conflicts and can be looked at as a normal distribution agent error. We can “fix” this error and allow the distribution agent to continue by locating the error number (207 in this case) and adding it to the “SkipErrors” part of the distribution agent profile. Make sure that the agent is stopped and restarted to pick up the new profile for this to take effect. This will ignore the problem with the update statement and allow the schema change to continue as normal.
At the moment the recommendation is to avoid conflicts through selecting an appropriate application design but should that not be possible/not be done correctly then hopefully this will help clarify a bit more the issues surrounding conflicts in P2P replication, and how we might try to solve any issues that arise. Getting to grips with this technology is probably going to be increasingly important as it increasingly begins to overlap with the merge replication arena.
Paul Ibison, Copyright Â© 2013