Huge table (9 million records) on the SQL server

Our underwriting company just sent us an extract of data from potential customers. There are 9 million lines. Lines consist of LeadID (guid), RawLeadXML (xml - possibly a maximum of 3-4kb) and LeadStatusID (int).

At first I tried to add an autodial integer and make it the primary key of this table. Well, he added the field, but could not make it the primary key (there is not enough memory in the buffer pool).

What I need to do is take each entry, 1 by 1, and get the XML, put it in the XmlDocument object in .Net, cross off all the fields that I want (name, surname, etc.), and save it parsed information to another table.

I can’t even run this statement: select * from Leads, where id is from 1 to 1000

If I just select 1000 records at a time (select the top 1000 * from potential customers), which works, but how would I then select the next 1000 records without any reference point?

My machine has 4 processors (2.53 GHz) and 12 GB of RAM. This is not a server, but it is a muscular machine. I don’t know what to do next, honestly.

EDIT . I did not notice that the source file was an MDF file (and its associated LDF), so I just connected to them in SQL Server.

EDIT 2 : I messed up and said that the RawLeadXML column was XML - it is not, it is just nvarchar (max). I honestly did not know that there is an XML data type.

EDIT 3 : I can't even issue the delete statement in this table: "delete from rows where leadid = '100a7927-5311-4f12-8fe3-95c079d32dd4' 'explodes:

Msg 802, Level 17, State 20, Line 2 There is insufficient memory available in the buffer pool. 

I have no idea what to do next. How the hell is this even a problem? There are thousands of databases in the world with more records than mine.

EDIT 4 . In case someone cares, none of the solutions listed below worked. I am sure this is a limitation of my car, and definitely not condemning the excellent answers that I received below. Currently, I am transferring zipped DB (2.6 GB) to our servers in Rackspace, and then I will try to add an index to this equipment, hopefully without removing our production servers. Once the index is added, I hope that I can fix the db and return it to my local machine, and then I can actually do something with it.

EDIT 5 . My machine was literally unable to process a table of this size. My machine has 12 GB of RAM, 64-bit Windows 7 Professional, a 2.53 GHz quad-core processor, an SSD drive, etc. This is pretty tricky for a development machine. And he could not handle it.

So, I moved the database to our server in Rackspace in London. 48 GB or memory in it, and he was able to add the index I needed. Even after that, my machine here was not able to do anything useful with this, so I wrote a .NET program that runs in London to print 1000 records at a time, parse them in another table, and then mark the source records as processed.

As soon as I do this, I will have to leave the database in London because I doubt that I will be able to write any meaningful messages against this monster at the local level. This will contribute to the development of entertainment.

Summary I believe that there is no good way to process a data set of this size without using server class hardware with at least 48 GB of memory (in my case).

+8
windows-7 sql-server-2008
source share
9 answers

9 million rows are not so large, you probably do not have an index in the LeadId column. First create it, although it will take some time (it is not necessary to be unique or primary key). How to use "SELECT TOP 1000 LeadId, RawXML ORDER BY LeadId" for the first query. Write down the last LeadId value (MaxLeadId), use "SELECT TOP 1000 LeadId, RawXML Where LeadId> MaxLeadId ORDER BY LeadId" and so on ...

+4
source share

Error 802 does not mean out of memory in the classical sense of insufficient memory to perform allocation (which will cause error 701 ). Error 802 actually indicates that the buffer pool cannot grow, which can happen for several reasons:

  • maximum server memory explicitly prohibits buffer pool growth, check server settings. Virtual address space limitation
  • x86 is hit and AWE is not turned on . Verify that there is an x86 (32-bit) instance, and if so, verify that all conditions for enabling AWE are met.

If you still haven’t found the problem, read How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005 (this article applies equally to SQL Server 2008 and 2008 R2) and follow the instructions to understand who / what is consuming your memory.

+9
source share

Adding a column is not an option, since adding an automatic identifier also does not work. You want to save the changed / cleared data in a new table. This table may contain auto-id identifiers and separate columns for data that you retrieve from Xml.

After entering data into this table, you can delete the source rows from the source table. Or make a select statement that excludes rows with a GUID that is already in the new table.

Since the rows of the source table have no dependency or a specific order, the order in which you select / process your rows does not matter.

Based on the comment, a SQL statement is proposed:

 WHILE EXISTS(SELECT * FROM [source] [s] WHERE NOT EXISTS(SELECT * FROM [destination] [d] WHERE [d].[leadId] = [s].[leadId])) BEGIN INSERT INTO [destination] ([leadId], [RawLeadXML], [LeadStatusId]) SELECT TOP 100 [s].[leadId], [s].[RawLeadXML], [s].[LeadStatusId] FROM [source] [s] WHERE NOT EXISTS(SELECT * FROM [destination] [d] WHERE [d].[leadId] = [s].[leadId]) END 

I set the number of records to insert to 100. This should save you from using memory.

+3
source share

Do you have a spare drive? Perhaps by creating the same table structure in "TableAux", but with Autonumeric Id, and then pasting from the table ...

Import into TableAux from a flat file (first export it if it is not a flat file), this is another way to do this.

Obtaining an identifier for your registers is a priority for working with them.

+1
source share

What about

  • Select the first line (at the top 1) of the tab, save the LeadID in var.
  • Load the xml column value into an XML document (.NET)
  • specify the nodes you need using xpath
  • insert these values ​​into a new record
  • delete the record from the "main" table with leadid (or mark it with "done")
  • commit
  • Step 1 again
+1
source share

Look at this with row_number () and ranks.

Take a look at this topic . It has basic concepts to get you started.

+1
source share

If it is a flat file, can you get the first 1000 lines, load them, and then go back and delete lines from the file based on the GUID and then repeat? (First back up the file.)

You can also try downloading this file through the Integration Services package, which should be able to track where it is located with import, and should be able to execute their package and stay within the limits of your memory.

0
source share

What format do you get initially? If it is csv or somesuch and you do not need other information in this xml fragment, I would use SSIS and completely skip this staging table. The native XML type has some associated overhead that you don’t need to take if you just want to destroy one xml once and then drop it.

Otherwise, you should be able to use the XML path statements directly in the SQL query to retrieve data from one table to another. If you are trying to add a new PC from a designer, this can greatly fail on a large table. You will need a script to make changes and run them manually, and possibly tweak them to make things more efficient. Ultimately, the 9-meter rows are not so large by modern standards, but you need to think about it more than with 9k.

0
source share

How to simply extract data to a text file directly from the access database, then you can simply import it into the RDBM of your choice with a volume insert.

Accessing the database through reference tables in MSSQL is not ideal. Here we use the cursor only for reading, only for direct dialing, to go through the lines one by one.

This is VBScript, so it will not be the fastest, but it should work. You will need to configure the XML extraction procedure for your schema.

 ' Code run against a access DB with column RawLeadXML Dim connection Set connection = CreateObject("ADODB.Connection") connection.Open "Driver={Microsoft Access Driver (*.mdb)}; DBQ=d:\path\to\file.mdb;" Dim recs ' By default this is a read only, forward only cursor Set recs = connection.Execute("SELECT * FROM Leads") Dim fso Set fso = CreateObject("Scripting.FileSystemObject") Dim out Set out = fso.OpenTextFile("d:\path\to\output.txt", 2, True) Dim id id = 0 While Not recs.EOF id = id + 1 out.Write CStr(id) ' write an ID counter out.Write "," ExtractFieldsFromXML recs.Fields("RawLeadXML").Value, out out.Write Chr(10) ' Linefeed recs.MoveNext Wend out.Close recs.Close connection.Close ' Extract data from the XML and write it to the stream ' separated by commas Sub ExtractFieldsFromXML(xml, out) Dim doc Set doc = CreateObject("MSXML2.DOMDocument") MsgBox xml doc.loadXML xml out.Write doc.selectSingleNode("/xml/firstname").text out.Write "," out.Write doc.selectSingleNode("/xml/lastname").text End Sub 
0
source share

All Articles