The best way to get great values from a large table

Question

The best way to get great values from a large table

I have a db table with about 10 columns, two of which are month and year. Now the table has about 250 thousand rows, and we expect that it will grow by about 100-150 thousand records a month. Many queries include a month and year column (for example, all records from March 2010), so we often need to get available combinations of the month and year (i.e. do we have records for April 2010?).

The employee believes that we should have a separate table from our main one, which contains only the months and years in which we have data. We only add records to our main table once a month, so at the end of our scripts there is just a small update to add a new record to this second table. This second table will be queried whenever we need to find the available month / year records in the first table. This decision seems to me stupid and a violation of DRY.

What do you think, how to solve this problem? Is there a better way than having two tables?

+6

performance sql-server

derivation Apr 21 '10 at 18:16

source share

5 answers

Using a simple index for the required columns (Year and month) should significantly improve either the DISTINCT query or GROUP BY.

I would not go with the secondary table, as this adds an extra overhead to support the secondary table (inserting / removing updates will require checking the secondary table)

EDIT:

You might want to consider using Performance Enhancement with SQL Server 2005 Indexed Views.

+12

Adriaan stander Apr 21 '10 at 18:21

source share

create a materialized indexed view:

SELECT DISTINCT MonthCol, YearCol FROM YourTable

Now you will get access to the previously calculated different values, without missing work every time.

+1

KM. Apr 21 '10 at 18:30

source share

Set the date of the first column in the clustered index key table. This is very typical of historical data, because most, if not all, queries are interested in certain ranges, and a clustered time index can solve this problem. All queries, such as the month of May, should be considered as ranges, for example: WHERE DATECOLKEY BETWEEN '05/01/2010' AND '06/01/2001' . Answering a question like “are there any records in May”, a simple search in the cluster index will be applied.

Although this seems difficult for a programmer, this is an optimal approach to the problem of database design.

+1

Remus Rusanu Apr 21 '10 at 19:01

source share

Use a materialized view, also called (indexed view with schema binding and creating an index on it). When you do this, the SQL server will essentially create and use this extra table suggested by your colleague (the data will be stored in the index) and will preserve the integrity of the data for you. Here's how to do it:

Create a view that returns the individual [month] [year] values, and then the index [year] [month] in the view. SQL Server will use a tiny index in the view instead of scanning a large table. Since the SQL server will not allow you to index the view with the DISTINCT keyword, instead of GROUP BY [year], [month] and use BIG_COUNT (*) in SELECT. It will look something like this:

 CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING AS SELECT [year], [month], COUNT_BIG(*) [MonthCount] FROM [dbo].[YourBigTable] GROUP BY [year],[month] GO CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)

Now when you select DISTINCT [Year], [Month] on a large table, the query optimizer scans a tiny index in the view instead of scanning millions of records in a large table.

 SELECT DISTINCT [year], [month] FROM YourBigTable

This method took me from 5 million reads with an I / O rating of 10.9 to 36 reads with an I / O rating of 0.003. The overhead for this will be associated with maintaining an additional index, so every time a large table is updated, the update index will also be updated.

If you find that this indicator significantly slows down loading time. Drop the index, load the data, and then recreate it.

Full working example:

  CREATE TABLE YourBigTable( YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY, [Year] INT, [Month] INT) GO CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING AS SELECT [year], [month], COUNT_BIG(*) [MonthCount] FROM [dbo].[YourBigTable] GROUP BY [year],[month] GO CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month) SELECT DISTINCT [year], [month] FROM YourBigTable -- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month

+1

David Sopko Feb 09 '12 at 16:42

source share

Gabriel Guimarães · Accepted Answer · 2010-04-21T21:36:28+0000

Make sure the columns indicate the cluster pointer. and split the table into these date columns in the place of the data files on different disks I believe that your fragmentation index is low your best shot.

I also believe that having a physical view with the desired choice is not a good idea, because it adds nested / updated overhead. an average of 3.5 inserts per minute. or about 17 seconds between each insertion (on average, please correct me if I am wrong)

The question is, what do you choose more often than every 17 seconds? This is the key point. Hope this helped.

The best way to get great values ​​from a large table

More articles:

The best way to get great values from a large table