Is ORDER BY and ROW_NUMBER () deterministic?

I used SQL in database password engines from time to time for several years, but with little theoretical knowledge, so my question can be very "noobish" for some of you. But for me this is becoming important, so I have to ask.

Imagine an Urls table with a unique status column. And for the question, we assume that we have a large number of lines, and the status has the same value in each record.

And imagine that we execute the request many times:

 SELECT * FROM Urls ORDER BY status 
  • Do we get the same row order each time or not? If we do what happens if we add a few new lines? Will the order change or will new entries be added to the end of the results? And if we do not get the same order, on what conditions does this order depend?

  • Do ROW_NUMBER() OVER (ORDER BY status) will return the same order as the request above, or is it based on a different ordering mechanism?

+7
sql sql-server tsql
source share
4 answers

It is very simple. If you need an order you can rely on, then you need to include enough columns in the ORDER BY so that the combination of all these columns is unique for each row. Nothing else is guaranteed.

For a single table, you can usually get what you want by specifying the columns that are “interesting” for sorting, and then include the primary key columns. Since the PC itself guarantees uniqueness, the whole combination also guarantees an unambiguous determination of the order, for example. If the Urls table has the primary key {Site, Page, Ordinal} , then the following will give you a reliable result:

 SELECT * FROM Urls ORDER BY status, Site, Page, Ordinal 
+9
source share

ORDER BY unstable in SQL Server (and in no other database as far as I know). Stable sorting is one that returns records in the same order as in the table.

The reason for the high level is quite simple. Tables are sets. They have no order. Thus, a “stable” view simply does not make sense.

The reasons for the lower level are probably more important. The database may implement a parallel sorting algorithm. Such algorithms are not stable by default.

If you want a stable look, include the key columns in the sort.

This is stated in the documentation :

In order to achieve consistent results between query requests using OFFSET and FETCH, the following conditions must be met:

Basic data that is not used. That is, the affected rows on the request are not updated, or all the page requests from the request are executed in one transaction using either a snapshot or serializable transaction isolation. For more information about these transaction isolation levels, see SET TRANSACTION INSULATION LEVEL (Transact-SQL).

An ORDER BY clause contains a column or combination of columns that is guaranteed to be unique.

+7
source share

I really like these types of questions, as you can do performance analysis.

First, let's create a sample database [test] with a table [urls] with a million random entries.

See the code below.

 -- Switch databases USE [master]; go -- Create simple database CREATE DATABASE [test]; go -- Switch databases USE [test]; go -- Create simple table CREATE TABLE [urls] ( my_id INT IDENTITY(1, 1) PRIMARY KEY , my_link VARCHAR(255) , my_status VARCHAR(15) ); go -- http://stackoverflow.com/questions/1393951/what-is-the-best-way-to-create-and-populate-a-numbers-table -- Load table with 1M rows of data ; WITH PASS0 AS ( SELECT 1 AS C UNION ALL SELECT 1 ), --2 rows PASS1 AS ( SELECT 1 AS C FROM PASS0 AS A , PASS0 AS B ), --4 rows PASS2 AS ( SELECT 1 AS C FROM PASS1 AS A , PASS1 AS B ), --16 rows PASS3 AS ( SELECT 1 AS C FROM PASS2 AS A , PASS2 AS B ), --256 rows PASS4 AS ( SELECT 1 AS C FROM PASS3 AS A , PASS3 AS B ), --65536 rows PASS5 AS ( SELECT 1 AS C FROM PASS4 AS A , PASS4 AS B ), --4,294,967,296 rows TALLY AS ( SELECT ROW_NUMBER() OVER ( ORDER BY C ) AS Number FROM PASS5 ) INSERT INTO urls ( my_link , my_status ) SELECT -- top 10 search engines + me CASE ( Number % 11 ) WHEN 0 THEN 'www.ask.com' WHEN 1 THEN 'www.bing.com' WHEN 2 THEN 'www.duckduckgo.com' WHEN 3 THEN 'www.dogpile.com' WHEN 4 THEN 'www.webopedia.com' WHEN 5 THEN 'www.clusty.com' WHEN 6 THEN 'www.archive.org' WHEN 7 THEN 'www.mahalo.com' WHEN 8 THEN 'www.google.com' WHEN 9 THEN 'www.yahoo.com' ELSE 'www.craftydba.com' END AS my_link , -- ratings scale CASE ( Number % 5 ) WHEN 0 THEN 'poor' WHEN 1 THEN 'fair' WHEN 2 THEN 'good' WHEN 3 THEN 'very good' ELSE 'excellent' END AS my_status FROM TALLY AS T WHERE Number <= 1000000 go 

Secondly, we always want to clear the buffers and cache when performing a performance analysis in our test environment. In addition, we want to include I / O statistics and time for comparing results.

See the code below.

 -- Show time & i/o SET STATISTICS TIME ON SET STATISTICS IO ON GO -- Remove clean buffers & clear plan cache CHECKPOINT DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO 

Third, we want to try the first TSQL statement. Look at the execution plan and write down the statistics.

 -- Try 1 SELECT * FROM urls ORDER BY my_status /* Table 'urls'. Scan count 5, logical reads 4987, physical reads 1, read-ahead reads 4918, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3166 ms, elapsed time = 8130 ms. */ 

enter image description here

Fourth, we want to try the second TSQL statement. Remember to clear the query plan cache and buffers. If you do not, the request takes less than 1 second, since most of the information is in memory. Look at the execution plan and write down the statistics.

 -- Try 2 SELECT ROW_NUMBER() OVER (ORDER BY my_status) as my_rownum, * FROM urls /* Table 'urls'. Scan count 5, logical reads 4987, physical reads 1, read-ahead reads 4918, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3276 ms, elapsed time = 8414 ms. */ 

enter image description here

And last but not least, here is the interesting part, performance analysis.

1 - We see that the background is a super-set of the first. In this way, both plans scan the clustered index and sort the data. Parallelism is used to combine results.

2 - The second plan / request should calculate the line number. It segments data and calculates this scalar. So we end up with two more operators in the plan.

It is not surprising that the first plan is executed in 8130 ms, and the second plan - 8414 ms.

Always look at the query plan. Both are priced and relevant. They say that you want the engine to plan and what it really does.

In this example, two different TSQL statements have almost the same plans.

Yours faithfully

John

www.craftydba.com

0
source share

The general answer to any sql question is "what order does this output in" is "regardless of how the server feels, and this may not be the case from request to request", unless you specifically requested an order.

Even something simple, such as "select top 1000 myColumn from myTable", can return with any rows in any order; for example, the server can use parallel threads, and the first thread, to start returning the results, started reading in the middle of the table, or the index that included myColumn was used, so you got the rows with the first name productName in alphabetical order (this time, the last since the index had different statistics, so it chose a different index and gave you the 1000 oldest transactions) ...

It is theoretically possible for the server to say: "I have these 10 pages in the memory cache that match your request, I will give you these while I wait until the disk returns the rest ...

0
source share

All Articles