One-way database synchronization

It is often necessary to synchronize data from the main tables in one database in order to clone tables in other databases, often on other servers. For example, consider the case where the backend system manages the inventory data and that the inventory data must ultimately be transferred to one or more databases that are part of the website application.

The source data in the backend system is very normal, with dozens of tables and foreign key constraints. This is a well-established OLTP RDBMS system. Many of the tables in question contain millions of rows. You must regularly distribute this data to other databases. As often as possible; latency may be allowed. First of all, the maximum runtime of both the backend and remote databases is extremely important.

I am using SQL Server and are familiar with change tracking, rowversion, triggers, etc. I know that Microsoft strongly encourages replication, SyncFx and SSIS for these scenarios. However, there is a big difference between vendor documents and reviews recommending technology, as well as the actual implementation, deployment, and maintenance of the solution. In the SQL Server world, replication is often seen as a turnkey solution, but I'm trying to explore alternative solutions. (There is some concern that replication is difficult to administer, makes schema changes difficult, and if reinitialization is required, there will be a lot of room for critical systems.)

There are a lot of mistakes. Due to the complex foreign key relationships between a large number of tables, determining which order to capture or apply updates is not trivial. Due to the unique indexes, two rows can be blocked in such a way that updating by row will not work (it is necessary to perform intermediate updates for each row until the final update). These are not necessarily show stoppers, since unique indexes can often be changed to regular indexes, and foreign keys can be disabled (although disabling foreign keys is highly undesirable). You will often hear “simple” using SQL 2008 change tracking and SSIS or SyncFx. Such answers really do not justify practical difficulties. (And, of course, customers really have a difficult time wrapping their heads around how copying data can be so complicated that it makes the difficult situation even worse!)

This problem is ultimately very versatile: do one-way synchronization of many heavily loaded database tables with lots of rows. Almost everyone who deals with databases is dealing with such a problem. Documents distributed, practical experience difficult to find. We know that this can be a difficult problem, but work must be done. Let me hear about what worked for you (and what to avoid). Tell us about your experiences with Microsoft products or products from other vendors. But if you have not personally tested the battle with a large number of closely related tables and rows, please refrain from answering. Let it be practical - not theoretical.

+6
sql database sql-server data-synchronization
source share
1 answer

It’s better to ask serverfault.com (I can’t post comments, the scripts are broken in SO, so I have to post the full answer)

Update: (switched to Safari, the scripts work again, I can send the message properly)

There is no silver bullet. For ease of use and deployment of a "single key", nothing can exceed replication. This is the only solution that encompasses deep detection and resolution of conflicts, supports pushing for circuit changes and offers a complete set of tools for configuring and monitoring it. It was a subsidiary of MS-data synchronization many years before this “agenda” was accepted by the .Net crowd. In my opinion, replication has two main problems:

  • The technology used to push change is primitive, slow, and unreliable. This requires file shares to initiate replicas, and it depends on T-SQL for actual data replication, which leads to various scalability problems: replication threads use work server threads and the fact that they interact with arbitrary tables and application requests leads to blocking and dead ends. The largest deployments I’ve heard of are around 400-500 sites and are performed by superhuman MVPs and top dollar consultants. This stops many projects that start on 1,500 sites (outside of the largest replication projects that are being deployed). I'm curious to see if I'm wrong, and you know the SQL Server replication solution deployed with more than 500 sites.
  • The replication metaphor is too data oriented. It does not take into account the requirements of distributed applications: the need for versions and formalized contracts, the autonomy of fiefdoms data, the lack of communication from availability and security. As a result, a replication solution will solve the immediate need to “make data available there, but cannot solve the real problem” my app should talk to your app. "

At the other end of the spectrum, you will find solutions that truly solve the communication problem of applications, such as message queue-based services. But they are either painfully slow or riddled with problems associated with the separation of the communication mechanism (web services and / or msmq) and data storage (DTC transactions between comm and db, lack of a common availability history, lack of a common recovery history, etc. And etc.). Solutions that are incredibly fast and fully integrated with the database exist on the MS stack , but no one knows how to use them. Somewhere between them and replication you will find various intermediate solutions, such as the OCS / Synch framework and custom solutions based on SSIS. No one will offer ease of configuration and monitoring of replication, but they can scale and work better.

I participated in several projects requiring "data synchronization" on a very large scale (+1200 sites, +1600 sites), and my solution was to turn the problem into a "application communication" problem. Once the mentality is changed to this, and the data stream is no longer considered a “record with key X of table Y”, but instead a “message informing about the purchase of element X by consumer Y”, the solution becomes easier to understand and apply. You no longer think in terms of “inserting records into XYZ order, so the FK relationship does not break,” but instead “buying a process,” as described in the XYZ message.

In my opinion, replication and its derivatives (ie tracking data and sending data-grams) are solutions that are enshrined in the technology of "80" and the presentation of data / applications. Outdated dinosaurs (and in no way turn into birds).

I know that this does not even begin to consider all your (very legitimate) problems, but writing out everything I have to say / rant / rable on this topic will fill the volumes of the paperback book ...

+7
source share

All Articles