BigInt vs. VarChar Performance Indicators

Question

BigInt vs. VarChar Performance Indicators

This is the FACT table in the data warehouse

It has a composite index as follows

ALTER TABLE [dbo].[Fact_Data] ADD CONSTRAINT [PK_Fact_Data] PRIMARY KEY CLUSTERED ( [Column1_VarChar_10] ASC, [Column2_VarChar_10] ASC, [Column3_Int] ASC, [Column4_Int] ASC, [Column5_VarChar_10] ASC, [Column6_VarChar_10] ASC, [Column7_DateTime] ASC, [Column8_DateTime] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY] GO

In this structure, all varchar 10 columns have only numeric values. Would it be useful to change this 78 million row structure to store BIGINT instead of VARCHAR in terms of queries and indexing?

Any other advantages / disadvantages that I should consider?

+7

performance tsql sql-server-2005

Raj more Oct 21 '09 at 20:40

source share

3 answers

Two things that can affect index (and shared database) performance:

1) Index page size 2) Comparison speed

So, for the first, in general, the smaller your index / data page, the more pages you can store in memory, and the greater the likelihood that this query will be able to find the page in a cache or a slow disk. Thus, you would like to use the smallest data type that can conveniently fit your existing and proposed future needs.

BigInt - 8 bytes; VARCHAR may be smaller if the data size is small, so it really depends on your data. However, 10-character long numbers can match the SQL Server INT data type ( http://msdn.microsoft.com/en-us/library/ms187745.aspx ) depending on the size, so int vs. bigint depends on your domain.

In addition, if your entire line has a fixed length, there are some certain optimizations that SQL Server can perform when scanning, because it knows exactly where the next line will be on disk (assuming the lines are adjacent). Of course, an edge, but it can help.

For the second, it is faster to compare integers than unicode strings. Thus, if you are only storing numerical data, you definitely need to switch to a numerical data type of the appropriate size.

Finally, Mark is right that this is becoming a very confusing primary key. However, if your data requires this - for example, these are ONLY your columns, and you never make add'l queries, you can fine tune the optimized version (with Bigints, etc.) of your main key. However, be that as it may, the smell of code, so I will recommend his advice to really take a look at your data model and see if this is correct.

+4

Matt rogish Oct 21 '09 at 21:02

source share

Marc S is right that a 64-byte primary key will be duplicated in each NC index, so you will pay the cost of I / O, which will affect the amount of data stored in memory (since you lose space on the NC index page). Therefore, on this basis, the question is not "I have to convert my varchars", but "should I consider converting my cluster index into something completely different. /

In terms of varchar vs bigint there is a good reason to convert if you can afford the time; that beyond the 2-byte difference in storage per field, when you compare the values of two different types, SQL will be forced to convert one of them. This will happen with every single comparison, whether for joining an index or a predicate inside a where clause.

Depending on what you choose for the data that the dimension tables are tied to the fact table, you might be able to collect the conversion overhead for each query so that it joins because it has to convert one side of it.

+1

Andrew Oct 21 '09 at 20:59

source share

marc_s · Accepted Answer · 2009-10-21T20:43:46+0000

You must DEFINITELY enter the surrogate INT IDENTITY() primary key !! INT already gives you potentially up to 2 billion lines - isn't that enough?

This primary key / cluster key on SQL Server will be up to 64 bytes in size (instead of 4 for INT), which will make your clustered index, and your entire non-clustered index will be inflated beyond recognition. The entire clustering key (all of your 8 columns) will be included on each page of each individual non-clustered index in this table - surely a lot of space has been spent.

Thus, in any given index table, you will have up to 16 times more records with a surrogate clustered INT key - this means that you have less I / O, less time spent reading index pages.

And imagine that you are trying to establish the relation of a foreign key to this table .... any child table should have all 8 columns of your primary key as foreign key columns and specify all 8 columns in each connection - what a nightmare!

In 78 million lines, even changing the clustering key in INT IDENTITY will save you up to 60 bytes per line - this will have up to 4 GB of free disk space (and using RAM on your server). And this did not even begin to calculate savings by non-clustered indices ........

And, of course, yes, I would also change VARCHAR (10) to INT or BIGINT - if this is a number, make the field type numeric - do not leave it in VARCHAR (10), really. But this in itself will not matter much in terms of speed or performance - it simply simplifies the work with data (it is not necessary to always turn to numerical types when, for example, compares values, etc.).

Mark

BigInt vs. VarChar Performance Indicators

More articles: