Type 2 Effective Index Rating with Multiple Dimensions in TSQL

Question

Type 2 Effective Index Rating with Multiple Dimensions in TSQL

How do composite indexes work in effective date tables?

Using T-SQL, let's say I have a table that is efficient with EffectiveStartDate and EffectiveEndDate, related to the product for recording historical price fluctuations, so my table will take the form:

MyTable: = (Date EffStartDate, Date EffEndDate, ProductID int, Money ProductPrice) where EffEndDate = '12 / 31/9999 'when the record is really valid.

Suppose I implement two indexes in this table in the form: Clustered on (EffEndDate, EffStartDate, ProductID) Nonclustered (EffEndDate, ProductID)

In my opinion, creating an index for clustered indexes stores information in a B-tree (potentially B +), ordered by column specification of the index creation operator. Therefore, I will present a sort table by EffEndDate, then EffStartDate, then ProductID. In most cases, I want to historically query this table with a query similar to this: Select * from MyTable where ProductID = @ProductID and @MyDate between EffStartDate and EffEndDate.

I am trying to imagine how the B-tree actually stores information related to these three columns. Does it save it as a tuple object, as you might find in Python, or does it add more sizes to tree B when the index is compound? For example, for a given EffEndDate, does the B-tree have several splitting trees related to EffStartDates, and then several splitting trees related to ProductID, or is each split based on a tuple? This answer seems to believe that he is taking the tuple approach: Question .

If you need a one-dimensional approach, it's hard for me to understand how these types of indexes provide a consistent value for finding the date range between two columns. For instances, I see this happening so that, given the date (@MyDate), we can use the Index's EffEndDate component to restrict our search to only EffEndDates> = @MyDate, and then use the EffStartDate component to limit our search to only EffStartDate <= @MyDate, and then search for ProductID in this remaining range. How to use an index?

The problem that I foresee with this is that if we have about 100 thousand products that are updated unevenly every week, we will eventually use this clustered index to create a giant set of date ranges, and then find each date range for an instance of our desired ProductID. Is there a better index to implement this type of query?

I believe that a non-clustered index exists to quickly find the current ProductID prices, because for this we need only two pieces of the puzzle, since EffEndDate will be set to '12 / 31/9999 '.

Alternatively, is there a way to implement a two-dimensional multidimensional index to improve query performance in T-SQL?

Thanks!

+5

sql-server tsql indexing

Pkmnbugcatcher May 15, '15 at 19:36

source share

3 answers

No LoanID in the table

I assume you mean ProductID

If you are going to search in ProductID = @ProductID, then why in the world would you bury it as the tail of a composite index. Why are you making lightweight material last?

100 thousand updates per week - nothing. You are already thinking about it. Just put the index in each column and let the query optimizer do what it does.

If you are configured for a composite index, then ProductID, Date EffStartDate, Date EffEndDate.
You are no better than index search!

+1

paparazzo May 16, '15 at 7:55

source share

Simulate real data. Create a large table (the size of the final table should be the same as in real life) with the distribution of products and dates, as you expected in real life. Start by adding three separate independent indexes for the products, start date, end date. Try running a query. Review the implementation plan. Try other index combinations. Compare plans and performance. If nothing gives acceptable performance, come back here with a script that generates sample data and your request.

In my test, the optimizer was the inner join of the results of three independent index queries.

Create table

plus three independent indexes for each column:

 CREATE TABLE [dbo].[Test]( [ID] [int] IDENTITY(1,1) NOT NULL, [ProductID] [int] NOT NULL, [StartDate] [date] NOT NULL, [EndDate] [date] NOT NULL, CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED ( [ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] CREATE NONCLUSTERED INDEX [IX_EndDate] ON [dbo].[Test] ( [EndDate] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] CREATE NONCLUSTERED INDEX [IX_ProductID] ON [dbo].[Test] ( [ProductID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] CREATE NONCLUSTERED INDEX [IX_StartDate] ON [dbo].[Test] ( [StartDate] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

Generate test data

Only 1M lines.
Up to 100 different product identifiers with uniform distribution.
Start dates within 10,000 days from 2000-01-01 (~ 27 years)
End dates are within 1000 days from the start date (duration up to ~ 3 years).

inquiry:

 INSERT INTO Test(ProductID, StartDate, EndDate) SELECT TOP(1000000) CA.ProductID ,DATEADD(day, StartOffset, '2000-01-01') AS StartDate ,DATEADD(day, StartOffset+DurationDays, '2000-01-01') AS EndDate FROM sys.all_objects AS o1 cross join sys.all_objects AS o2 cross apply ( SELECT cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 100 + 1 as int) AS ProductID ,cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 10000 as int) AS StartOffset ,cast((cast(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) * 1000 as int) AS DurationDays ) AS CA

Request for optimization:

 DECLARE @VarDate date = '2004-01-01'; SELECT * FROM Test WHERE ProductID = 1 AND @VarDate >= StartDate AND @VarDate <= EndDate ;

It returns ~ 500 rows.

Execution plan

The server suggested the following index:

 CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>] ON [dbo].[Test] ([ProductID],[StartDate],[EndDate]) INCLUDE ([ID])

but having such an indicator is stupid, IMHO.

If you have only 1M lines and 100K different product identifiers, not 100; in other words, if a search by a specific product identifier excludes the vast majority of rows, then the best option probably has one index for ProductID and includes other columns:

 CREATE NONCLUSTERED INDEX IX_Product ON [dbo].[Test] ([ProductID]) INCLUDE ([StartDate],[EndDate])

OR

 CREATE NONCLUSTERED INDEX IX_Product ON [dbo].[Test] ([ProductID], [StartDate]) INCLUDE ([EndDate])

OR

 CREATE NONCLUSTERED INDEX IX_Product ON [dbo].[Test] ([ProductID],[EndDate]) INCLUDE ([StartDate])

If one of the dates gives good selectivity, then instead of it comes an index, not ProductID.

If none of the columns has good selectivity, then this is difficult.

Edit

It is foolish to blindly make an index, as suggested by the optimizer, because you know that you will search for a specific ProductID, but then for the StartDates row and then the EndDates range. So, the third column of EndDate will never be used for the search itself. In this case, it is better to INCLUDE this column in the index, rather than make it part of the index, as I showed above.

If the query was for a specific ProductID and for a specific StartDate (and not a range), and then for a range of EndDate (or a specific EndDate), then using EndDate as part of the index would help.

+1

Vladimir Baranov May 16 '15 at 11:01

source share

mwigdahl · Accepted Answer · 2015-05-15T21:44:39+0000

This is an application that really requires a two-dimensional or spatial index, as you rightly noted, since you effectively compose two separate inequality searches. Without interfering with tables in a form where you can use SQL Server spatial indexes, your options are limited.

The best approach, if possible, is to find some kind of business relationship between EffStartDate and EffEndDate. If there is a rule that these values cannot be further apart than a year, for example, then this is something that can be encoded in your WHERE clause to give you additional selectivity for indices that you could otherwise have large checks.

Sort of:

SELECT * FROM Table WHERE @date BETWEEN EffStartDate and EffEndDate AND DATEADD(year, -1, @date) < EffStartDate

... where you add an additional constraint for the business to reduce the search space that the query must go through.

Two articles that may interest you are:

Quassnoi will answer a similar question , which talks about how to force this data type to be formatted that can be spatially indexed, and also has a link to his blog, which describes a recursive CTE method that can be used to speed up these types of queries without changing the circuit.

Michael Asher 's article on using business knowledge to improve performance over similar types of queries.

Type 2 Effective Index Rating with Multiple Dimensions in TSQL

More articles: