How to configure a new SQL Server database for future replication?

Question

How to configure a new SQL Server database for future replication?

I am creating a system that may require support for 500+ concurrent users, each of which performs dozens of requests (selects, inserts AND updates) every minute. Based on these requirements and tables with many millions of rows, I suspect that in the future you will need to use database replication to reduce the load on the query.

Without using replication in the past, I wonder if there is something I need to consider in circuit design?

For example, I was once told that to enable replication, you must use the GUID for primary keys. It's true? What specific considerations or recommendations for database design exist for the database to be replicated?

Due to time constraints for the project, I don’t want to waste time replicating when it might not be needed. (I have enough specific problems to overcome at the moment, without worrying about resolving the possible ones.) However, I do not want to be able to avoid possible schema changes in the future when / if replication is required in the future.

Any other recommendations on this subject will also be appreciated, including good places to study replication implementation.

+6

sql-server database-design replication

Matt lacey Feb 08 '10 at 20:35

source share

3 answers

You can use a GUID for primary keys — in replicated system strings, they must be unique throughout your topology, and PK GUIDs are one way to achieve this.

Here is a short article on using GUID in SQL Server

+1

flesh Feb 08 '10 at 20:37

source share

I would say that your real question is not how to handle replication, but how to handle scaling, or at least scale for queries. And although there are various answers to this riddle, one answer will stand out: not using replication.

The problem with replication, especially with merge replication, is that records get multiplication in replication. Let's say you have a system that processes a load of 100 requests (90 reads and 10 records) per second. You want to scale, and you choose replication. Now you have 2 systems, each of which processes 50 requests, 45 views and 5 records. Now these records should be replicated, so the actual number of records is not 5 + 5, but 5 + 5 (original record), and then another 5 + 5 (replica writes), so you have 90 readings and 20 records. Thus, while the load on each system was reduced, the ratio of records to reads increased. This not only changes the I / O patterns, but, most importantly, it changes the load matching pattern. Add a third system and you will have 90 readings and 30 entries, etc. Etc. Soon you will have more records than reading, and the latency of replication updates combined with privacy issues and merge conflicts will cause your project to crash. The bottom line is that "soon" is much earlier than you expect. Most likely, to justify a search on a scale instead, since in any case you are talking about a scale of 6-8 peers, and a 6-8 times increase in capacity using scaling will be faster, much easier, and maybe even cheaper, start with.

And keep in mind that these are all purely theoretical numbers. In practice, the fact that the replication infrastructure is not free, it adds its own load on the system. It is necessary to track records, changes must be read, the distributor must exist to store the changes until they are distributed to subscribers, then the changes must be recorded and mediated for possible conflicts. That's why I saw very few deployments that could claim success with a replication-based scaling strategy.

One option is to scale read-only, and here replication does work, usually using transactional replication, but also performs log processing or mirroring using a database snapshot.

The real alternative is a partition (i.e., addition). Requests are routed in the application to the appropriate section and placed on the server with the corresponding data. Changes on one part, which should be reflected on another section, are sent via asynchronous (usually based on messaging) means. Data can only be combined into a section. For a more detailed discussion of what I'm saying, read how MySpace does it . Needless to say, such a strategy has a big impact on the design of the application and cannot be simply glued after v1.

+1

Remus Rusanu Feb 08 '10 at 22:10

source share

Adam robinson · Accepted Answer · 2010-02-08T20:44:37+0000

While each row should have a rowguid column, you do not need to use Guid for your primary key. In fact, you don’t even need to have a primary key (although you will be stoned to death for not being able to create it). Even if you define your primary key as a guideline and not in a rowguid column, as a result, Replication Services creates an additional column for you. You can definitely do this, and this is a good idea, but it is by no means necessary and not particularly beneficial.

Here are some suggestions:

Keep tables (or rather rows) small; if you do not use column-level replication, you will load / load the entire contents of the row, even if only one column is changed. In addition, smaller tables simplify and solve the conflict resolution problem.
Do not use sequential or deterministic primary keys controlled by an algorithm. This includes identity columns . Yes, replication services will process the identity columns and highlight key allocations separately, but this is a headache that you don't want to deal with. This in itself is a great argument for using Guid for your primary key.
Do not let your applications perform unnecessary updates. This is actually a bad idea, but this problem is significantly worsened in replication scenarios, both in terms of bandwidth utilization and conflict resolution.

How to configure a new SQL Server database for future replication?

More articles: