I am developing an application, one of the aspects of which is that it should receive huge amounts of data into the SQL database. I designed the database structure as a separate table with the bigint identifier, something like this:
CREATE TABLE MainTable ( _id bigint IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, field1, field2, ... )
I will omit how I intend to fulfill requests, as this is not relevant to the question I have.
I wrote a prototype that inserts data into this table using SqlBulkCopy. He seemed to be working fine in the lab. I was able to insert tens of millions of records at a speed of ~ 3K records / sec (the full record in itself is quite large, ~ 4K). Since the only index in this table is bigint auto-increment, I did not see a slowdown even after a significant number of rows were pressed.
Given that the lab SQL server was a relatively weak configuration virtual machine (4Gb RAM shared with another ssbsystem for virtual machines), I expected to get significantly better throughput on the physical machine, but this did not happen, or say, an increase performance was negligible. I could maybe get 25% faster than inserting into a physical machine. Even after I configured the 3-drive RAID0, which ran 3 times faster than a single drive (measured by software for comparison), I did not get an improvement. Basically: a faster control subsystem, a dedicated physical processor, and dual RAM were practically not translated into any increase in performance.
Then I repeated the test using the largest instance on Azure (8 cores, 16Gb), and I got the same result. Thus, adding more cores did not change insertion speed.
At this time, I played with the following software options without a significant performance boost:
- Change SqlBulkInsert.BatchSize
- Insert from multiple threads at the same time and configure # threads
- Using the table lock option in SqlBulkInsert
- Eliminate network latency by pasting from a local process using a shared memory driver
I am trying to increase performance by at least 2–3 times, and my original idea was that dropping more hardware would be done, but for now it’s not.
So can someone recommend me:
- What resource can be suspected in a bottleneck? How to confirm?
- Is there a methodology that I could try to get reliably scalable to improve bulk insertion, given the existence of a single SQL server system?
UPDATE I am sure that downloading the application is not a problem. It creates an entry in the time queue in a separate thread, so when there is an insert, it looks like this (simplified):
===>start logging time int batchCount = (queue.Count - 1) / targetBatchSize + 1; Enumerable.Range(0, batchCount).AsParallel(). WithDegreeOfParallelism(MAX_DEGREE_OF_PARALLELISM).ForAll(i => { var batch = queue.Skip(i * targetBatchSize).Take(targetBatchSize); var data = MYRECORDTYPE.MakeDataTable(batch); var bcp = GetBulkCopy(); bcp.WriteToServer(data); }); ====> end loging time
timings are logged, and the part that creates the queue never takes a significant fragment
UPDATE2 I implemented a collection of how long each operation in this loop takes, and the layout looks like this:
queue.Skip().Take() - negligibleMakeDataTable(batch) - 10%GetBulkCopy() - MinorWriteToServer(data) - 90%
UPDATE3 I am developing a standard version of SQL, so I cannot rely on partitioning because it is only available in the Enterprise version. But I tried a variant of the splitting scheme:
- 16 filegroups were created (from G0 to G15),
- only 16 tables are executed for insertion only (from T0 to T15), each of which is tied to a separate group. Tables have no indexes at all, not even clustered int identifiers.
- streams that insert data will cycle through all 16 tables. This makes almost an assurance that each bulk insert operation uses its own table.
This gave a ~ 20% improvement in volume insertion. The core CPU, LAN interface, drive I / O were not maximized and were used at about 25% of maximum capacity.
UPDATE4 I think that now he is as good as he is. I was able to push the inserts at reasonable speeds using the following methods:
- Each bulk insert goes into its own table, then the results are combined into the main
- Tables are recreated fresh for each bulk insert, table locks are used.
- The used implementation of IDataReader is here instead of DataTable.
- Bulk inserts made from multiple clients
- Each client accesses SQL using a separate gigabit VLAN
- Extraneous processes accessing the main table use the NOLOCK option
- I reviewed sys.dm_os_wait_stats and sys.dm_os_latch_stats to eliminate claims
I have a hard time deciding at this point who gets credit for the answer question. For those of you who have not received a “response,” I apologize, it was a very difficult decision, and I thank you all.
UPDATE5 . The following element may use some optimization:
- The used implementation of IDataReader is here instead of DataTable.
If you do not run your program on a machine with a massive processor core count, it may use some re-factoring. Since it uses reflection to generate get / set methods, this becomes the main processor load. If performance is key, it adds a lot of performance when you manually code the IDataReader to compile it instead of using