Efficient way to change a 100GB table

We have several databases that store from 10 to 100 gigabytes of data in one of the tables. It contains image data. The problem is that many of these databases were not created properly. Basically, a primary key is not really a primary key. They were created with a unique index in a column with a zero value. And some of them have int as primary key, not bigint.

So, we slowly walked through and fixed these databases. They run on SQL Server 2000 through SQL Server 2008, although most of them with primary key problems relate to SQL Server 2000. The problem is that we don’t want to lock the database all day while it transforms the table. We examined several strategies:

  • Tell SQL Server to directly change the column type. This locks the table until it is completed, and leaving it overnight in many cases, it is still not finished.

  • Paste all the images in a new table at a time. This was more easily interrupted, but the entire table is mainly written to the log file in the process.

  • Insert 100 rows while the rows do not exist in the target table. On the plus side, they can continue to use the database while this happens (with great success), and that it can be stopped and restarted arbitrarily at any time, and it prevents log files of 100 GB + in size. This is what we are doing now, but finding the top 100 rows that don't exist really slows down a lot as the target table gets bigger and bigger. UPDATING STATISTICS and DBCC INDEXDEFRAG help a lot, but in our last attempt we reached the point that even 100 images at one time sat there without reacting.

    INSERT INTO %s SELECT TOP 100 Source.* FROM %s AS Source WITH (NOLOCK) LEFT OUTER JOIN %s AS Target WITH (NOLOCK) ON Source.DocumentID = Target.DocumentID WHERE Target.DocumentID IS NULL ORDER BY Source.DocumentID 

So the question is, is there an option that can copy bulk data in an efficient and renewable way? It does not have to be 100% accurate, we can always come back and correct any inconsistencies at the end, while it is 99% work.

+7
bulkinsert sql-server-2000
source share
1 answer

Connection is a problem. Do not do this. Just scroll through the current table using some reasonable interval using the current clustered index. Something like:

 Declare @idrange int; Set @idrange = 1; WHILE @idrange < 10000000 INSERT INTO Destination SELECT * FROM Source WHERE DocumentID between @idrange and @idrange + 999 ORDER BY Source.DocumentID Set @idrange = @idrange + 1000 End 

Note that for best speed, remove all indexes (including the clustered index) from the destination table, then add indexes after all rows have been inserted.

EDIT : changed the span of the range to prevent overlapping (since BETWEEN includes endpoints)

Final clarification: The common point of my example script is that you just want to go through your current records in some reasonable order and put them in a new table in packages. There is no reason to keep checking the destination table every time, as you should already know what you put there and what remains. In most cases, it makes sense to use a clustered index (if one exists), since this means that it can move through the physical order of the table without searching through bookmarks. If there is no cluster in the table, just use whatever makes the most sense (your PC, maybe).

+3
source share

All Articles