Performance Review. Distribute rows in multiple tables and concentrate all rows in one table.

Performance Consideration: Distributing rows in multiple tables and concentrating all rows in a single table.

Hey.

I need to record information about each step that is performed in an application in SQL DB. There are certain tables, I want the log to be associated with: Product - should be logged when the product was created, etc. Order is the same as above Delivery is the same, etc. etc.

Data will often need to be retrieved.

I have few ideas on how to do this:

  • Enter the log table that will contain the columns for all these tables, and then when I want to present the data in the user interface for a specific Product, select * from the log, where LogId = Product.ProductId. I know that it can be fun to have many colonies, but I have the feeling that the performance will be better. On the other hand, this table will have a huge number of rows.
  • Have a lot of log tables for each type of log (ProductLogs, OrderLogs, etc.) I really don't like this idea, since it is incompatible and has many tables with the same structure, it makes no sense, but (?) It can be faster. when you are looking in a table with fewer rows (wrong?).
  • According to statement No. 1, I could make a second many-to-one table that would have the LogId, TableNameId and RowId columns and would refer to the log line to many rows of the table in the database, than the UDF would have for retrieving data (for example, log id 234 belongs to the Customer table at CustomerId 345 and to the Product table, where productId = RowId); I think this is the best way to do this, but then again, maybe a huge number of lines, will this slow down the search? or here's how it should be done, what to say? ...

Example No. 3 in the above list:

CREATE TABLE [dbo].[Log]( [LogId] [int] IDENTITY(1,1) NOT NULL, [UserId] [int] NULL, [Description] [varchar](1024) NOT NULL, CONSTRAINT [PK_Log] PRIMARY KEY CLUSTERED ( [LogId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO ALTER TABLE [dbo].[Log] WITH CHECK ADD CONSTRAINT [FK_Log_Table] FOREIGN KEY([UserId]) REFERENCES [dbo].[Table] ([TableId]) GO ALTER TABLE [dbo].[Log] CHECK CONSTRAINT [FK_Log_Table] --------------------------------------------------------------------- CREATE TABLE [dbo].[LogReference]( [LogId] [int] NOT NULL, [TableName] [varchar](32) NOT NULL, [RowId] [int] NOT NULL, CONSTRAINT [PK_LogReference] PRIMARY KEY CLUSTERED ( [LogId] ASC, [TableName] ASC, [RowId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO SET ANSI_PADDING OFF GO ALTER TABLE [dbo].[LogReference] WITH CHECK ADD CONSTRAINT [FK_LogReference_Log] FOREIGN KEY([LogId]) REFERENCES [dbo].[Log] ([LogId]) GO ALTER TABLE [dbo].[LogReference] CHECK CONSTRAINT [FK_LogReference_Log] --------------------------------------------------------------------- CREATE FUNCTION GetLog ( @TableName varchar(32), @RowId int ) RETURNS @Log TABLE ( LogId int not null, UserId int not null, Description varchar(1024) not null ) AS BEGIN INSERT INTO @Log SELECT [Log].LogId, [Log].UserId, [Log].Description FROM [Log] INNER JOIN LogReference ON [Log].LogId = LogReference.LogId WHERE (LogReference.TableName = @TableName) AND (LogReference.RowId = @RowId) RETURN END GO 
+4
source share
4 answers

I would choose option 3 for several reasons:

The data should be in the fields of the table, and not in the form of a table name (option 2) or a field name (option 1). Thus, the database is easier to work with and easier to maintain.

Narrow tables in the genre work better. The number of rows has less impact on performance than the number of fields.

If you have a field for each table (option 1), you will likely get many empty fields, if only some of the tables are affected by the operation.

+1
source

Be careful with database optimization. Most databases are fairly fast and somewhat complex. First you want to run a performance test.

The second placement in just one table makes it more likely that the results you want are in the cache, which will significantly speed up performance. Unfortunately, this also makes it much more likely that you will have to look for a giant table to find what you are looking for. This can be partially solved using the index, but indexes are not freed (they make the record more expensive, for one).

My advice would be to do a test to make sure that performance really matters, then check out the various scenarios to see which one is the fastest.

+3
source

If you are talking about large amounts of data (millions of rows +), then it will be useful for you to use different tables to store them.

eg. a basic example of 50 million journal entries, assuming 5 different "types" of the log table. It is better to have 5 x 10 million rows of lines than 1 x 50 million rows of the table

  • INSERT performance will be better with separate tables - indexes on each table will be smaller and therefore faster / easier to update / maintain as part of the insert operation

  • READ performance will be better with separate tables - fewer query requests, smaller indexes to move. In addition, it seems that you will need to save an additional column to determine what type of journal entry is the entry (product, delivery ....)

  • MAINTENANCE on smaller tables is less painful (statistics, defragmentation / recovery, etc.)

Essentially, it's about sharing data. Starting with SQL 2005, it supports partitioning support (see here ), but for this you need Enterprise Edition, which basically allows you to split the data in one table for better performance (for example, you will have one log table and then determine how data is broken in it)

I recently listened to an interview with one of the eBay architects, who emphasized the importance of separation when performance and scalability are necessary, and I strongly agree based on my impressions.

+2
source

Try to implement your level of data access in such a way that, if necessary, you can move from one database model to another - this way you just choose one option and will worry about the performance consequences later.

Without performing some performance tests and not having an accurate idea of ​​what loads you are going to optimize, because performance depends on a number of factors, such as the number of reads, the number of records, and whether read and write errors can conflict and cause a lock.

My preference would be for option 1 btw - its easiest to do, and there are a number of settings that you can do to help fix the various kinds of problems that may arise.

0
source

All Articles