SQL max () function with where clause and group does not use index efficiently

I have a MYTABLE table that contains approximately 25 columns, two of which are USERID (integer) and USERDATETIME (dateTime) .

I have an index on this table in these two columns, with USERID being the first column followed by USERDATETIME .

I would like to get the maximum USERDATETIME for each USERID. So:

 select USERID,MAX(USERDATETIME) from MYTABLE WHERE USERDATETIME < '2015-10-11' GROUP BY USERID 

I would expect the optimizer to be able to find each unique USERID and maximum USERDATETIME with the number of hits equal to the number of unique USERID s. And I expect it to be reasonably fast. I have 2,000 users and 6 million rows in myTable. However, the actual plan shows 6 million rows from the index scan. If I use an index with USERDATETIME / USERID , the plan will change to use index search, but another 6 million rows.

Why doesn't SQL use the index in such a way as to reduce the number of rows processed?

+6
source share
2 answers

If you use SQL Server, this is not the optimization usually performed by the product (except in limited cases where the table is divided by this value ).

However, you can do it manually using the technique here.

 CREATE TABLE YourTable ( USERID INT, USERDATETIME DATETIME, OtherColumns CHAR(10) ) CREATE CLUSTERED INDEX IX ON YourTable(USERID ASC, USERDATETIME ASC); 

 WITH R AS (SELECT TOP 1 USERID, USERDATETIME FROM YourTable ORDER BY USERID DESC, USERDATETIME DESC UNION ALL SELECT SubQuery.USERID, SubQuery.USERDATETIME FROM (SELECT T.USERID, T.USERDATETIME, rn = ROW_NUMBER() OVER ( ORDER BY T.USERID DESC, T.USERDATETIME DESC) FROM R JOIN YourTable T ON T.USERID < R.USERID) AS SubQuery WHERE SubQuery.rn = 1) SELECT * FROM R 

enter image description here

If you have another table with UserIds, you can easily get an effective plan with

 SELECT U.USERID, CA.USERDATETIME FROM Users U CROSS APPLY (SELECT TOP 1 USERDATETIME FROM YourTable Y WHERE Y.USERID = U.USERID ORDER BY USERDATETIME DESC) CA 

enter image description here

+2
source

The WHERE clause is the limiting factor of your query using the index.

In a standard SQL Server query, indexes are used either to quickly select records (which this index may indicate) or to restrict the returned records (which this index did not allow). So why does this index not allow you to quickly reduce the limit?

When the query optimizer considers optimizations based on the WHERE clause, it looks for an index that either starts with the element (s) in the WHERE clause, or one that can be used to efficiently identify records that are allowed (or not allowed) in the result set.

Using this index, the server can first find various user identifiers. Then he would like to limit the lines examined in accordance with the WHERE clause. However, for this, the optimizer is likely to appreciate that he will have to carry out the equivalent of a full index or scan the table AFTER finding the user identifiers.

An alternative strategy that may be possible is to scan the index, identify user IDs and dates together. This is what the optimizer chose.

One possible solution for this is a different index - one by date, then a user ID - in addition to the one that is used. This would limit the number of records scanned to identify maximum UserID values ​​and therefore be slightly faster.

Note that your index would be fast if you didn't need a WHERE clause. But the where clause requires the optimizer to consider a use case in which the WHERE clause restricts the items selected for the last row examined.

Also, the index in which the Date field was DESCENDING may be more efficient.

0
source

All Articles