Any good literature on compound performance versus systematic denormalization?

Question

Any good literature on compound performance versus systematic denormalization?

As a consequence of this question, I was wondering if there were good comparative studies that I could advise and read about the advantages of using RDMBS, to make the connection optimization vs systematically denormalize, so that I can always access one table at a time.

In particular, I want to get information about:

Performance or normalization compared to denormalization.
Scalability of a normalized and denormalized system.
Denormalization health issues.
Problems of model consistency with denormalization.

A bit of history to see where I am here: our system uses its own abstraction level of the database, but it is very old and cannot process more than one table. Thus, all complex objects must be created using multiple queries in each of the related tables. Now, to make sure that the system always uses one table, systematic denormalization is used in all tables, sometimes smoothing two or three levels. Regarding nn relationships, they seem to have worked on this, carefully creating their data model to avoid such relationships and always backtrack from 1-n or n-1.

The end result is a complex complex system in which the customer often complains about performance. By analyzing such a bottleneck, they do not question these basic premises on which the system is based, and always look for another solution.

Did I miss something? I think the whole idea is wrong, but someone lacks conclusive evidence to prove (or refute) it, that’s where I will turn to your collective wisdom to point to good, well-accepted literature that can convince another person to my team’s approach is wrong (to convince me that I'm too paranoid and dogmatic about consistent data models).

The next step I create my own stand and collect the results, as I hate reinventing the wheel. I want to know what is on this subject.

---- EDIT Notes: the system was first built with flat files without a database system ... only later was it transferred to the database because the client insisted on using the Oracle system. They did not refactor, but simply added support for relational databases to the existing system. Flat file support was later removed, but we still expect refactors to take advantage of the database.

+7

sql premature-optimization legacy-code denormalization

Newtopian Aug 2 '09 at 8:06

source share

2 answers

As far as I know, Dimensional Modeling is the only systematic denormalization method that has some theory. This is the foundation of data warehouse methods .

DM was first introduced by Ralph Kimball in The Dimensional Modeling Dummy in 1997. Kimball has also written many books. In a book that appears to have the best reviews, Data Warehouse Toolkit: A Complete Guide to Dimensional Modeling (Second Edition) "(2002), although I have not read it yet.

There is no doubt that denormalization improves the performance of certain types of queries, but does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, products and orders (in a typical e-commerce application), and you need it to be the fastest to request Products in this order, then you can store the data in a denormalized a way to support this and get some benefit.

But this makes it more inconvenient and inefficient to request all Orders for this Product. If you have the same need to create both types of queries, you should stick to a normal design. This leads to a compromise, giving both queries the same performance, although neither of them will be as fast as they would be in the denormalized design, which preferred one type of query.

In addition, when you store data in a denormalized way, you need to do extra work to ensure consistency. That is, without accidental duplication and without violating referential integrity. You must consider the cost of adding manual consistency checks.

+1

Bill karwin Aug 2 '09 at 23:05

source share

djna · Accepted Answer · 2009-08-02T08:52:57+0000

thought: do you have a clear resistance mismatch, a data access level that allows you to access only one table? Stay right here; it is simply incompatible with the optimal use of a relational database. Relational databases are designed to efficiently perform complex queries. In order to have no parameter other than returning a separate table and, presumably, no addition to the word bausiness, it just does not make sense.

To justify the normalization and potential costs of consistency, you can refer to all Codd onwards material, see article .

I predict that benchmarking of this kind will be endless activity, there will be many special cases. I maintain that normalization is "normal", people get good enough performance for a clean deisgn database. Perhaps the approach may be a poll: "How are your data normalized? Scale from 0 to 4."

Any good literature on compound performance versus systematic denormalization?

More articles: