VERY huge SQL database: what should the schema look like?

Question

VERY huge SQL database: what should the schema look like?

I have 2 files that I would like to import into MS SQL. The first file is 2.2 GB, and the second is 24 GB. (if you're interested: this is a poker lookup table)

Importing them into MS SQL is not a problem. Thanks to SqlBulkCopy, I was able to import the first file in just 10 minutes. My problem is that I don’t know what the actual table layout should look like to allow me to make very fast queries. My first naive attempt looks like this:

  CREATE TABLE [dbo]. [TblFlopHands] (
     [hand_id] [int] IDENTITY (1,1) NOT NULL,
     [flop_index] [smallint] NULL,
     [hand_index] [smallint] NULL,
     [hs1] [real] NULL,
     [ppot1] [real] NULL,
     [hs2] [real] NULL,
     [ppot2] [real] NULL,
     [hs3] [real] NULL,
     [ppot3] [real] NULL,
     [hs4] [real] NULL,
     [ppot4] [real] NULL,
     [hs5] [real] NULL,
     [ppot5] [real] NULL,
     [hs6] [real] NULL,
     [ppot6] [real] NULL,
     [hs7] [real] NULL,
     [ppot7] [real] NULL,
     [hs8] [real] NULL,
     [ppot8] [real] NULL,
     [hs9] [real] NULL,
     [ppot9] [real] NULL,
  CONSTRAINT [PK_tblFlopHands] PRIMARY KEY CLUSTERED 
 (
     [hand_id] ASC
 ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
 ) ON [PRIMARY]

The flop index is a value from 1 to 22100 (the first 3 community cards in Texas Hold'em, 52 choose 3). Each flop index has a hand_index from 1 to 1176 (49 select 2). Thus, in this table only 25 989 600 lines.

Completing the request with my "schema" above took approx. 25 seconds. After some googling, I found that the SQL server is scanning the table, which is obviously bad. I launched the "Database Engine Tuning Advisor" and recommended creating an index in the flop_index column (it makes sense). After creating the index, the required disk space for the database just doubled! (plus the LDF log file grew by 2.6 GB) But after indexing the request took only a couple of ms.

Now my question is: how do I do it right? I have never worked with such massive data, the databases that I created before were a joke.

Some notes: after importing data into MS SQL there will never be an insert or update of data, just select. So I wonder if I need a primary key?

EDIT: I provide additional information to make my question clearer:

1) I will never use hand_id. I just put it there because someone told me long ago that I should always create a primary key for each table.

2) Basically, only one request will be used:

  SELECT hand_index, hs1, ppot1, hs2, ppot2, hs3, ppot3, hs4, ppot4, hs5, ppot5, hs6, ppot6, hs7, ppot7, hs8, ppot8, hs9, ppot9 WHERE flop_index = 1 ... 22100

This query will always return 1176 rows of data that I need.

EDIT2: Just to be more specific: Yes, this is static data. I have this data in a binary file. I wrote a program to request this file with the data that I need in just a few milliseconds. The reason I need this data in the database is because I want to be able to request data from different computers on my network without having to copy 25 GB to each computer.

HS stands for hand strength, it tells you about the current hand strength of your hole cards in combination with a flop or turn cards. ppot means positive potential, this is a chance that your hand will be ahead when the next community card is dealt. hs1 - 9 is the strength of the hands against 1-9 opponents. The same goes for ppot. On-the-fly ppot calculation is very intensive and takes several minutes to calculate. I want to create a poker analysis program that gives me a list of all possible combinations of hole combinations on any flop / turn with their hs / ppot.

+6

sql database schema

Simon Aug 31 '09 at 19:37

source share

5 answers

Scott Ivey · Answer 1 · 2009-08-31T19:52:46+0000

To answer your question about the need for a primary key - only with the information you provided in the question:

Based on your table layout, you can save it there. If you delete this identifier column, you will also remove the clustered index. Your clustered index value (4 bytes) is stored as a pointer in each non-clustered index row. By deleting this clustered index, you will leave the table as a heap - and SQL will create an 8-byte row (row) identifier for each row in the table and will instead use this as a pointer in a non-clustered index. So, in your case, based on the scheme that you indicated in the question, you can potentially INCREASE the size of your nonclustered indexes and, in the end, slow them down.

With that, all of the above — based on queries that you could run (and their usage patterns) that were not included in the question — evaluating your clustered index as something other than an identity column can also be in a row.

Phill pafford · Answer 2 · 2009-08-31T19:55:21+0000

Well, you can split the table into smaller tables if, for example, hs (X) and ppot (X) should grow by nine.

This is what you have:

[hand_id] [int] IDENTITY(1,1) NOT NULL, [flop_index] [smallint] NULL, [hand_index] [smallint] NULL, [hs1] [real] NULL, [ppot1] [real] NULL, etc...

You can split it into 2 tables (maybe 3 if you need)

 Table hand: (EXAMPLE) [hand_id] [int] IDENTITY(1,1) NOT NULL, [flop_index] [smallint] NULL, [hand_index] [smallint] NULL Table hs_ppot (EXAMPLE) [hand_id] [int] IDENTITY(1,1) NOT NULL, [hs] [real] NULL, [ppot] [real] NULL

Then you can reference hand_id in each table. Simple though.

By the way, what are hs and ppot?

Mayo · Answer 3 · 2009-08-31T19:46:01+0000

This is a very common question. When you create indexes, this potentially reduces the time required for queries, but increases the time required for updates / inserts, and also increases the amount of disk space required for each record.

You need to decide, for each column, if the index offers performance improvements for your queries, and if it guarantees impact on insert / update performance and disk usage.

As an alternative to indexes, you can use an OLAP cube . If your query aggregates or applies calculations, you may want to run the query at night and save the results in another table. You can run simpler queries against a smaller table and achieve the same result with less impact on performance.

Garrett · Answer 4 · 2009-08-31T19:51:05+0000

How do you do your indexes and primkeys. If you just want to analyze the data, and if you are sure that the subsequent DML commands will be only SELECT (without INSERT), then deleting PK should be fine. In fact, the hand_id column is an IDENTITY (auto-increment) column, which means that SQL Server controls this value anyway (in fact, you cannot insert values into this column without moving on to the additional problem of enabling IDENTITY_INSERT mode before starting your INSERT, IIRC instructions).

Be careful with the changing needs of this database, of course. If you need to change, you should consider restrictions / indexes / keys.

If future data analysis is to be considered, consider using Microsoft SSAS (Analysis Services).

UPDATE: after reading the mayo answer, I agree that indexes (only for speed, not forced enforcement) are recommended for subsequent queries (recall that indexes speed up read operations, but usually do inserts / updates). Since your goal is to make a separate insert with an extension followed by SELECT queries, you can do your main insert and then add the necessary indexes to your database by the columns that are likely candidates in your queries.

Rob allen · Answer 5 · 2009-08-31T20:03:24+0000

Let me preface my review by saying that every possible combination in the database seems to be wrong. I'll go why in a minute.

I would start with a table called "Maps." For each possible card there will be 1 entry, and fields for a suit, face value, rank and yes, CardID as a primary key will be included in it. Also index the suit and face value.

If you want to lay out all possible Hold'em hands, I would make separate tables for pocket cards (pocketID, pCardID1, pCardID2), flopCards (flopID, fCardID1, fCardID2, fCardID3), and then a table for TurnAndRiver (turnAndRiverID, turnCardID, riverCardID) . Then a Manual table with (handID, pocketID, flopID, turnAndRiverID, handScore).

HandScore will be a computed field launched from a table or scalar value function.

By separating these bits, you avoid a lot of duplication, but you still have to worry about choosing and overlapping cards.

Ideally, I would give up manual tables and figure out the hand and score in what was ever created to use this data.

Including too much of your logic in the database can make adaptation difficult when the client asks you to model omaha or five-card drawing, for example.

According to your index question, yes, I would use a primary key, as this will allow you to quickly refer to a specific hand in your code.

Refresh

In response to OP Edit: it looks like you are using the wrong tool for this task. What is the value of having data in a database if you always choose the same set of records? Examine other parameters (e.g. flat XML file or static DataSet in your code). This will save connection time and overhead for starting the server, which is essentially static data.

VERY huge SQL database: what should the schema look like?

More articles: