Average 28000 requests/month












 

Merge Precomputed Partitions Bug in SQL Server 2005

"The case of the missing customer orders"

Article by Nigel Maneffa

Customer Joe Bloggs starts trading in the USA. The USA division creates a customer record and then mark his customer record that he is owned by the USA division. All of Joe Bloggs’ s data resides only on the USA server. The diagram below shows this in data terms. The merge replication horizontal/join filters achieve this.

 
Joe Bloggs then opens a store in the UK as well as the USA so the USA division change his customer record so that Joe Bloggs is joint owned by both the UK and the USA division – they should both be able to see his orders after this change BUT the UK only receives the customer record for Joe Bloggs, none of his orders are transferred onto the UK server. The merge process has copied the customer record to the UK, but none of his orders!

This is demonstrated below.

The UK division are up in arms as it appears to them that this customer has never bought a single item from the USA division despite Joe Bloggs rapid expansion. After much confusion and data reviews the DBA gets the blame! A lot of other data has also become out of sync as well due to the same problem.

The DBA runs merge dummy updates against the errant records but that does not work either. The DBA deletes/reinserts the records with the correct partitions set at the publisher, and watches for partition changes like a hawk, manually correcting the data.

The only workaround is to switch off precomputed partitions – change all the publications and rerun the snapshots.

Due to the performance hit and operational issues caused by running / applying snapshots the DBA has to run snapshots in the night and loses a night’s sleep to make sure they work properly.

The large performance benefits gained by using precomputed partitions are lost and the DBA gets another roasting from the Boss for slow sync times as well as the missing data.

Finally the Boss comments ‘I just don’t understand why you didn’t wait for a couple of service packs to let this new stuff settle down’. The tired DBA just smiles...

Technical Detail

Here are my findings: since I don't know what causes the problem (eg type, complexity of filtering) I do not know whether you will be affected or not. I will post up when MS have found the root cause.

ISSUE

Non-convergence on merge publication if the same articles are shared between two (or more) publications and the articles have parent-child relationship through join filters with precomputed partitions set (the default for most) partitioned publications. The publications in this repro have static horizontal filters. If you update one row in the parent table at one of the subscribers which qualifies for the partition of the second publication/subscription then the parent row gets downloaded but the child row(s) does not get downloaded on the second subscriber.

Dynamic filters were not tested.

Versions affected by bug

SQL2000 SP4 - OK (uses setupbelongs)
SQL2005 versions
RTM new publication  - OK
RTM upgrade to SP1 - OK
SP1 new publication - OK
SP1 upgrade to SP2 - OK (surprising)
SP2 new publication = BUG
SQL2000 directly upgraded to SQL2005 SP2 = BUG (ie snapshots not run after SP2 installed)

WORKAROUND

Warning – you will need to create a new publication for this to work with precomputed partitions switched off at time of creation (below), at least that is what my testing points to – you will need to verify in your own test environment what works for you

If you want to know when precomputed partitions are chosen by default during publication creation take a look here

If you want a full reproduction, I'll post one up soon....

 
 
 

Addendum, 17th July 2007: Fix for SQL2005 merge precomputed bug

This bug is fixed in build 3177 (KB 938363, not available when this was originally published). Note that you must run (the new for SP2) sp_vupgrade_mergeobjects to realise the fix. This was a regression bug arising from a fix to another bug relating to false conflicts that had been posted in the forums. It’s good to know that bug reports in the forums are taken seriously, but less so when the fix itself creates non convergence! Also worrying is the fact that this partition change scenario was not already in the product testing suite. I would have thought this would be a basic test. As I understand it when a bug is fixed the product team have to add the bug scenario to the testing suite to avoid having the same issue again.

 

One inherent problem highlighted by this bug is that it only surfaced when a new publication was created under SP2, upgrades from SP1 to SP2 would carry on unaffected.

What this could mean in practise: you can upgrade from SP1 to SP2 and everything would run perfectly. You could then script your publication, delete and recreate the publication, and it would then exhibit this bug, even though you might (very reasonably) expect that publications recreated in the same service pack would exhibit the same behaviour, bug or otherwise. This is a very undesirable behaviour, but it could be mitigated by using the new stored procedure sp_vupgrade_mergeobjects   http://technet.microsoft.com/en-us/library/bb326615.aspx    which is a very useful tool. I strongly recommend you make a mental note of this stored procedure, which dynamically recreates triggers, sp’s, views etc using the latest code base installed without new snapshots, reinit etc, so it is very useful for hotfixes like this, or when a replication object has been deleted in error.

(Technical Editor Paul Ibison)

 

May 2007