Query performance optimization for dynamically joined columns

Current Situation in SQL Server Database

There is an Entry table with the following columns:

  • EntryID (int)
  • EntryName (nvarchar)
  • EntrySize (int)
  • EntryDate (datetime)

Next should be able to save additional metadata for recording. The names and values ​​of this metadata should be free to choose and should be able to dynamically add them without changing the structure of the database table. Each metadata key can be one of the following data types:

  • Text
  • Numerical value
  • Datetime
  • Boolean value (True / False)

Thus, there is a DataKey table for representing metadata names and data types with the following columns:

  • DataKeyID (int)
  • DataKeyName (nvarchar)
  • DataKeyType (smallint) 0: text; 1: numeric 2: DateTime; 3: bit

In the DataValue table, for each combination of Entry and DataKey values ​​can be inserted depending on the data type of the metadata key. For each data type, there is one column with a null value. This table has the following columns:

  • DataValueID (int)
  • EntryID (int) Foreign key
  • DataKeyID (int) Foreign key
  • TextValue (nvarchar) Nullable
  • NumericValue (float) Nullable
  • DateValue (datetime) Nullable
  • BoolValue (bit) Nullable

Database structure image:

enter image description here

Target

The goal is to get a list of records that fulfill specifications, as in the WHERE clause. As in the following example:

Assumption:

  • Key metadata KeyName1 - text
  • Key Metadata KeyName2 - DateTime
  • KeyName3 metadata key is numeric
  • KeyName4 metadata key is Boolean

Query:

... WHERE (KeyName1 = „Test12345" AND KeyName2 BETWEEN '01.09.2012 00:00:00' AND '01.04.2013 23:59:00') OR (KeyName3 > 15.3 AND KeyName4 = True) 

The goal is to make these queries in a very efficient way, as well as with a lot of data, such as

  • Number of entries> 2,000,000
  • Number of data keys between 50 and 100, or possibly> 100
  • For writing at least a subset of the values ​​specified or, possibly, also the value for each key (2.000.000 * 100)

PROBLEM

The first problem arises when building a query. Typically, queries require sets with columns that can be used in a WHERE clause. In this case, the columns used in the queries are records in the DataKey table, as well as the ability to dynamically add metadata without changing the structure of the database table. During the study, a solution was found using the PIVOT table methods at runtime. But it turned out that this solution is very slow when the database has a large data set.

QUESTIONS

  • Is there a more efficient way or structure to store data for this purpose?
  • How can the requirements listed above be met, also in terms of performance and query time?

Here is the sql script with the described database structure and some sample data: http://www.sqlfiddle.com/#!3/d1912/3

+7
performance sql sql-server tsql database-performance
source share
7 answers

One of the fundamental flaws in the design of the cost of the Entity attribute (which you have here) is the difficulty of an efficient and effective query.

A more efficient structure for storing data is to reject EAV and use a normalized relational form. But this is necessarily associated with a change in the structure of the database when changing data structures (which should be taken for granted).

You can discard the TextValue / NumericValue / DateValue / BoolValue fields and replace them with a single sql_variant column, which slightly sql_variant complexity of the query, but the main problem remains.

As a side note, saving all numbers as floats will cause problems if you ever have to deal with money.

+5
source share

I don’t feel ready to comment on which is better, or about design approaches. Actually, I tend to not answer at all. However, I thought about your problem and that you took the time to clearly describe it, and that is how I approach it.

I would save each metadata data type in my table; So

 Table MetaData_Text: ID int identity EntryID int KeyName nvarchar(50) KeyValue nvarchar(max) 

MetaData_DateTime, MetaData_Boolean and MetaData_Numeric have the same structure as this, but with the corresponding different data type of the KeyValue column in each case.

The relationship between Entry and each of these tables is 0-Many; Although each row in each of these tables refers to one record.

To add a new metadata element for writing, I would simply use a stored procedure that takes an EntryID, a key name, and optional parameters for a possible metadata data type:

  create procedure AddMetaData @entryid int, @keyname varchar(50), @textvalue varchar(max) = null, @datevalue datetime = null, @boolvalue bool = null, @numvalue float = null as ... 

For the query, I would define a set of functions for managing each type of (a) metadata data type and (b) testing that needs to be performed for this data type, for example:

  create function MetaData_HasDate_EQ(@entryid int, @keyname varchar(50), @val datetime) returns bool as begin declare @rv bool select @rv = case when exists( select 1 from MetaData_DateTime where EntryID = @entryid and KeyName = @keyname and KeyValue = @val) then 1 else 0 end; return @rv end 

and include function references in the required request logic according to

  SELECT ... FROM entry e ... WHERE (dbo.MetaData_HasText_EQ(e.EntryID, 'KeyName1', 'Test12345') <> 0 AND dbo.MetaData_HasDate_Btwn(e.EntryID, 'KeyName2', '01.09.2012 00:00:00', '01.04.2013 23:59:00') <> 0) OR (dbo.MetaData_HasNum_GT(e.EntryID, 'KeyName3', 15.3) <> 0 AND dbo.MetaData_HasBool_EQ(e.EntryID, 'KeyName4', 1) <> 0) 
+1
source share

I believe that performance problems with such a data structure may require that the structure be redesigned.

However, I think this rather simple dynamic sql allows you to query as desired and seems to work quite fast in a quick test that I did with more than 100,000 rows in the Entry table and 500,000 in the DataValue table.

 -- !! CHANGE WHERE CONDITION AS APPROPRIATE --declare @where nvarchar(max)='where Key0=0' declare @where nvarchar(max)='where Key1<550' declare @sql nvarchar(max)='select * from Entry e'; select @ sql=@sql +' outer apply (select '+DataKeyName+'=' +case DataKeyType when 0 then 'TextValue' when 1 then 'NumericValue' when 2 then 'DateValue' when 3 then 'BoolValue' end +' from DataValue v where v.EntryID=e.EntryID and v.DataKeyID='+cast(DataKeyID as varchar) +') '+DataKeyName+' ' from DataKey; set @ sql+=@where ; exec(@sql); 
+1
source share

You did not provide any reference information about how often the table is updated, how often new attributes are added, etc.

Looking at your inputs, I think you can use a snapshot that aligns your normalized data. This is not ideal as the columns will need to be added manually, but it can be very fast. Pictures can be regularly updated at intervals depending on the needs of your users.

0
source share

First answer, why do people use EAV or KVP, even if they are so ineffective on demand? Blogs and tutorials have many plausible reasons. But in real life, you need to avoid talking to an inconsistent administrator.

For a small organization with a small amount of data, it is normal to have a multi-purpose database (OLTP + DW), since inefficiency is not noticeable. When your database gets big, it's time to replicate your online data to the data warehouse. In addition, if the data is intended for analytics, it should be replicated further from your relational data warehouse to a dimensional model or flat and wide for consumption.

These are the data models that I would expect from a large organization:

  • OLTP
  • Relational Data Warehouse
  • Reporting Size Model
  • Datamarts for Analytics.

So, to answer your question, you should not query your KVP tables, and creating a view on top of it does not improve it. It must be flattened (i.e., Rotate) into a physical table. You have a hybrid of 1 and 2. If there are no users for number 3, just create # 4.

0
source share

Based on Dan Belandy's answer, I think the easiest way to use this would be to have a stored procedure / trigger that looks at the metadata table and accordingly creates a view in the data table.

The code will look like this:

  -- drop old view IF object_id('EntryView') IS NOT NULL DROP VIEW [EntryView] GO -- create view based on current meta-information in [DataKey] DECLARE @crlf char(2) DECLARE @sql nvarchar(max) SELECT @crlf = char(13) + char(10) SELECT @sql = 'CREATE VIEW [EntryView]' + @crlf + 'AS' + @crlf + 'SELECT *' + @crlf + ' FROM [Entry] e' SELECT @sql = @sql + @crlf + ' OUTER APPLY (SELECT '+ QuoteName(DataKeyName) + ' = ' + QuoteName((CASE DataKeyType WHEN 0 THEN 'TextValue' WHEN 1 THEN 'NumericValue' WHEN 2 THEN 'DateValue' WHEN 3 THEN 'BoolValue' ELSE '<Unknown>' END)) + @crlf + ' FROM [DataValue] v WHERE v.[EntryID] = e.[EntryID] AND v.[DataKeyID] = ' + CAST(DataKeyID as varchar) + ') AS ' + QuoteName(DataKeyName) FROM DataKey --PRINT @sql EXEC (@sql) 

- Usage example:

 SELECT * FROM EntryView WHERE (Key1 = 0 AND Key2 BETWEEN '01.09.2012 00:00:00' AND '01.04.2013 23:59:00') OR (Key3 > 'Test15.3' AND Key4 LIKE '%1%') 
0
source share

I would use 4 tables, one for each data type:

 MDat1 DataValueID (int) EntryID (int) Foreign-Key DataKeyID (int) Foreign-Key TextValue (nvarchar) Nullable MDat2 DataValueID (int) EntryID (int) Foreign-Key DataKeyID (int) Foreign-Key NumericValue (float) Nullable MDat3 DataValueID (int) EntryID (int) Foreign-Key DataKeyID (int) Foreign-Key DateValue (datetime) Nullable MDat4 DataValueID (int) EntryID (int) Foreign-Key DataKeyID (int) Foreign-Key BoolValue (bit) Nullable 

If I had a partition available, I should use it in the DataKeyID for all 4 tables. Then I have to use 4 types:

 SELECT ... FROM Entry JOIN MDat1 ON ... EnMDat1 SELECT ... FROM Entry JOIN MDat2 ON ... EnMDat2 SELECT ... FROM Entry JOIN MDat3 ON ... EnMDat3 SELECT ... FROM Entry JOIN MDat4 ON ... EnMDat4 

So this example:

 WHERE (KeyName1 = „Test12345" AND KeyName2 BETWEEN '01.09.2012 00:00:00' AND '01.04.2013 23:59:00') OR (KeyName3 > 15.3 AND KeyName4 = True) 

Looks like:

 ...EnMDat1 JOIN EnMDat3 ON ... AND EnMDat1.TextValue ='Test12345' AND EnMDat3.DateValue BETWEEN '01.09.2012 00:00:00' AND '01.04.2013 23:59:00') ... UNION ALL ... EnMDat2 JOIN EnMDat4 ON ... AND EnMDat2.NumericValue > 15.3 AND EnMDat4.BoolValue = True 

This will work faster than a single metadata table. However, you will need a mechanism to create queries if you have many different scenarios for the sentences. You can also omit views and write instructions from scratch every time.

0
source share

All Articles