Getting billions of lines from a remote server?

Question

Getting billions of lines from a remote server?

I am trying to get about 200 billion rows from a remote SQL Server. To optimize this, I limited my query to use only an indexed column as a filter, and I only select a subset of columns so that the query looks like this:

SELECT ColA, ColB, ColC FROM <Database> WHERE RecordDate BETWEEN '' AND ''

But it seems that if I do not limit my request to a time window for several hours, the request will not be executed in all cases with the following error:

 OLE DB provider "SQLNCLI10" for linked server "<>" returned message "Query timeout expired". Msg 7399, Level 16, State 1, Server M<, Line 1 The OLE DB provider "SQLNCLI10" for linked server "<>" reported an error. Execution terminated by the provider because a resource limit was reached. Msg 7421, Level 16, State 2, Server <>, Line 1 Cannot fetch the rowset from OLE DB provider "SQLNCLI10" for linked server "<>".

The timeout is probably a problem due to the time required to complete the request plan. Since I don't have control over the server, I was wondering if there is a good way to get this data outside of the simple SELECT that I use. Are there any SQL Server tricks I can use? Perhaps tell the remote server to break up the data rather than issue a few requests or something else? Any suggestions on how I could improve this?

+7

sql sql-server sql-server-2008

Legend Jul 28 '11 at 21:55

source share

7 answers

Why read 200 billion lines at a time?

You must list them, counting several thousand lines at a time.

Even if you really need to read all 200 billion lines, you still need to use paging to break the reading into shorter requests — so if it crashes, you just keep reading where you left off.

See an efficient way to implement paging for at least one paging method using ROW_NUMBER

If you are performing data analysis, then I suspect that you are using the wrong storage (SQL Server is not really designed to handle large data sets), or you need to modify your queries so that the analysis runs on the Server using SQL.

Update: I think the last paragraph was somewhat misunderstood.

SQL Server storage is primarily intended for online transaction processing (OLTP) - efficiently querying massive datasets in massive parallel environments (for example, reading / updating one client record in a billions database, while thousands of other users do the same for other entries). Typically, the goal is to minimize the reading of data, reducing the amount of I / O required and also reducing the number of conflicts.

The analysis you are talking about is almost the exact opposite of this - the single client is actively trying to read all records in order to perform some statistical analysis.

Yes, SQL Server will handle this, but you should keep in mind that it is optimized for a completely different scenario. For example, data is read from disk on a page (8 KB) at a time, even though your statistical processing is probably based on only 2 or 3 columns. Depending on the row density and column width, you can use only a small fraction of the data stored on the 8 KB page - most of the data that SQL Server had to read and allocate memory was not even used. (Remember that SQL Server also had to block this page so that other users do not mess with the data while reading).

If you are serious about processing / analyzing massive datasets, then there are storage formats that are optimized for these kinds of things - SQL Server also has an add function on Microsoft Analysis Services , which adds additional online analytical processing (OLAP) and data mining capabilities using storage modes more suitable for this kind of processing.

+5

Justin Jul 28 '11 at 10:01

source share

Personally, I used a data extraction tool such as BCP to get the data in a local file before trying to manipulate it if I try to pull out so much data at once.

http://msdn.microsoft.com/en-us/library/ms162802.aspx

+3

EJ Brennan Jul 28 '11 at 10:02

source share

This is not a response to a specific SQL Server, but even when rDBMS supports server cursors, it considers them to be bad form. This means that you are consuming resources on the server, even though the server is still waiting for you to request more data.

Instead, you should reformulate your use of the query so that the server can transmit the entire set of results as soon as possible, and then completely forget about you and your query to make room for the next. When the result set is too large for you to process everything in one go, you have to keep track of the last row returned by the current batch so that you can select another batch starting from this position.

+2

SingleNegationElimination Jul 28 '11 at 10:01

source share

Most likely, the "Remote request timeout" is set on the remote server. How long does a failure request take?

+1

mrdenny Jul 29 '11 at 3:56

source share

Just run the same problem, I also got a message at 10:01 after starting the request.

Check this link. There, the remote request timeout parameter in the "Connections" section, which is set to 600 seconds by default, and you need to change it to zero (without restrictions) or another value that, in your opinion, is correct.

+1

Dukent Dec 6 '11 at 17:49

source share

Try changing the timeout property of the remote server.

To do this, go to SSMS, connect to the server, right-click the server name in the object explorer, then select Properties -> Connections and change the value in the Remote query timeout (in seconds, 0 = no timeout) text box Remote query timeout (in seconds, 0 = no timeout) .

0

Andrey Morozov Aug 12 '15 at 11:10

source share

Remus Rusanu · Accepted Answer · 2011-07-28T22:15:28+0000

This is more like SSIS . Even a simple stream, such as ReadFromOleDbSource-> WriteToOleDbSource, can handle this by creating the necessary batch processing for you.

Getting billions of lines from a remote server?

More articles: