SQL Server Odd Results (TSQL) with NEWID () in WHERE Clause

Question

SQL Server Odd Results (TSQL) with NEWID () in WHERE Clause

This is a strange question, but I'm a little confused to explain this behavior:

Background: (no need to know)

So, to get started, I wrote a quick query and inserted a UNIQUERIDENTIFIER list and wanted them to be homogeneous inside the WHERE X IN (...) . In the past, I used an empty UNIQUERIDENTIFIER (all zeros) at the top of the list so that I could insert a single set of UNIQUERIDENTIFIER that looks like this: ,'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX' . This time, in order not to press zeros, I inserted NEWID() , thinking that the chance of a collision was almost impossible, to my surprise, which led to thousands of additional results, for example 50 +% of the table.

Post a question: (part you need to know)

This request:

 -- SETUP: (i boiled this down to the bare minimum) -- just creating a table with 500 PK UNIQUERIDENTIFIERs IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable; CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY); INSERT INTO #wtfTable SELECT TOP(500) NEWID() FROM master.sys.all_objects o1 (NOLOCK) CROSS JOIN master.sys.all_objects o2 (NOLOCK); -- ACTUAL QUERY: SELECT * FROM #wtfTable WHERE [WtfId] IN ('00000000-0000-0000-0000-000000000000', NEWID());

... should statistically produce boops. But if you run it ten times or so, you'll sometimes get a massive selection. For example, in this last run, I got 465/500 rows, which means that more than 93% of the rows were returned.

Although I understand that NEWID() will be recalculated based on each line, in hell there is no statistical probability that it hits so hard. Everything that I wrote here is required to create a nuanced SELECT ; deleting anything will prevent it. By the way, you can replace IN with WHERE WtfId = '...' OR WtfId = NEWID() and still get the same results. I am using SQL SERVER 2014 Standard, updated to date, without any odd tweaks that I know.

So does anyone know what this is about? Thanks in advance.

Edit:

'00000000-0000-0000-0000-000000000000' is a red herring, here is a version that works with integers: (interestingly, I had to increase the size of the table to 1000 with integers to create a problematic query plan ...)

 IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable; CREATE TABLE #wtfTable (WtfId INT PRIMARY KEY); INSERT INTO #wtfTable SELECT DISTINCT TOP(1000) CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT) FROM sys.tables o1 (NOLOCK) CROSS JOIN sys.tables o2 (NOLOCK); SELECT * FROM #wtfTable WHERE [WtfId] IN (0, CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT));

or you can simply replace the UNIQUEIDENTIFIER literal and do this:

 DECLARE @someId UNIQUEIDENTIFIER = NEWID(); SELECT * FROM #wtfTable WHERE [WtfId] IN (@someId, NEWID());

both give the same results ... question Why is this happening?

+7

sql sql-server sql-server-2012 sql-server-2008-r2 sql-server-2014

ckozl Jul 21 '16 at 8:08

source share

2 answers

The following query returns nothing as expected. The internal casting type produces unexpected results, I think

 SELECT * FROM wtfTable WHERE convert(varchar(100),WtfId) = '00000000-0000-0000-0000-000000000000' or WtfId = NEWID() ;

+1

Eralper Jul 21 '16 at 8:37

source share

Vladimir Baranov · Accepted Answer · 2016-07-21T10:10:16+0000

Let's take a look at the execution plan.

In this particular query run, Seek returned 51 rows instead of a valued 1.

The following actual query creates a plan with the same form, but it is easier to parse because we have two variables @ID1 and @ID2 that you can track in the plan.

 CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY); INSERT INTO #wtfTable SELECT TOP(500) NEWID() FROM master.sys.all_objects o1 (NOLOCK) CROSS JOIN master.sys.all_objects o2 (NOLOCK); DECLARE @ID1 UNIQUEIDENTIFIER; DECLARE @ID2 UNIQUEIDENTIFIER; SELECT TOP(1) @ID1 = WtfId FROM #wtfTable ORDER BY WtfId; SELECT TOP(1) @ID2 = WtfId FROM #wtfTable ORDER BY WtfId DESC; -- ACTUAL QUERY: SELECT * FROM #wtfTable WHERE WtfId IN (@ID1, @ID2); DROP TABLE #wtfTable;

If you carefully study the operators in this regard, you will see that part of the IN query is converted to a table with two rows and three columns. The Concatenation statement returns this table. Each row in this auxiliary table defines a search range in the index.

 ExpFrom ExpTo ExpFlags @ID1 @ID1 62 @ID2 @ID2 62

Internal ExpFlags indicate which range search is needed ( < , <= , > , >= ). If you add more variables to the IN clause, you will see them in the plan combined with this helper table.

Operators

Sort and Merge Interval must be sure that all possible overlapping ranges are combined. A detailed post about the Merge Interval operator Fabiano Amorim who is considering plans with this form. Here's another good post about this form of Paul White's plan.

At the end, a two-row helper table joins the main table, and for each row in the auxiliary table, there is a search range in the cluster index from ExpFrom to ExpTo , which is shown in Index Seek . The Seek operator shows < and > , but is misleading. The actual comparison is determined by the internal value of the Flags .

If you have several different ranges, for example:

 WHERE ([WtfId] >= @ID1 AND [WtfId] < @ID2) OR [WtfId] = @ID3

you will still see the same plan form with the same search predicate, but with different Flags values.

So there are two options:

 from @ID1 to @ID1, which returns one row from @ID2 to @ID2, which returns one row

In a query with variables, internal expressions result in values from variables if necessary. The value of the variable does not change during query execution, and everything behaves correctly as expected.

How `NEWID()` affects it

When we use NEWID , as in your example:

 SELECT * FROM #wtfTable WHERE WtfId IN ('00000000-0000-0000-0000-000000000000', NEWID());

the plan and all internal processing are the same as for the variables.

The difference is that this internal table effectively becomes:

 ExpFrom ExpTo ExpFlags 0...0 0...0 62 NEWID() NEWID() 62

NEWID() is called twice. Naturally, each call creates a different value, which randomly leads to a range that spans some existing values in the table.

There are two scans of a cluster index range with ranges

 from `0...0` to `0...0` from `some_id_1` to `some_id_2`

Now it’s easy to understand how such a query can return some rows, although the probability of a NEWID collision NEWID very small.

It seems that the optimizer believes that it can call NEWID twice instead of remembering the first random value generated and use it further in the query. There were other cases where the optimizer, called NEWID more than expected, created similar, seemingly impossible results.

For example:

Is it legal for SQL Server to populate PERSISTED columns with data that does not match the definition?

Inconsistent results with computed column NEWID () and PERSISTED

The optimizer should know that NEWID() not deterministic. All in all, this seems like a mistake.

I don't know anything about the internal components of SQL Server, but my wild guess is: there are constant runtime functions like RAND() . NEWID() was put into this category by mistake. Then someone noticed that people do not expect him to return the same ID as RAND() returns the same random number for every call. And they fixed it, actually updating a new identifier every time NEWID() appears in expressions. But the general rules for the optimizer remained the same as for RAND() , therefore, the higher-level optimizer believes that all calls to NEWID() return the same value and freely rebuild expressions using NEWID() , which leads to unexpected results .

Another question arises about the similar strange behavior of NEWID() :

NEWID () Entered virtual table causes unintentional cross-cast behavior

The response says that there is a Connection Error Report and it is closed as "Do not fix". Microsoft's comments essentially say this is design behavior.

The optimizer does not guarantee synchronization or the number of executions of scalar functions. This is a long established principle. This is a fundamental "freedom" that allows the optimizer to get enough freedom, significant improvements in query execution.

SQL Server Odd Results (TSQL) with NEWID () in WHERE Clause

How NEWID() affects it

More articles:

How `NEWID()` affects it