When Denormalize Database Design

I know that normalis (z) is actively discussed in Stack Overflow. I have read many previous discussions. However, I have additional questions.

I am working on an old system with at least 100 tables. The database has some abnormal structure, tables that contain a lot of disparate data and other problems. I am tasked with improving it. I can't just start over, but you need to modify the existing schema.

In the past, I always tried to develop normalized databases. Now questions. The senior developer suggested that in some cases we cannot normalize:

1) With temporary data. For example, an invoice is created that refers to a product. If the client requests a copy of this invoice in a year, we should be able to make an exact copy of the original. What if the product price, name or description has been updated? The older guy suggested that the price and other product information be copied to the invoice table. I think maybe we should have another table, such as productPrice, which has a date field so that we can track price changes over time. Do we need the same to describe the product and the name that I guess? Seems complicated. What do you think?

2) The database is an accounting system. I am not very good at accounting. Currently, some summary data is output and stored in a database. For example, total sales for the year. My senior assistant said that accountants like to check the correctness by comparing this value with data that is actually calculated from invoices, etc., to give them confidence in the correct operation of the application. He said that at the moment, for example, we can say that someone accidentally deleted the invoice from last year, because the totals will not be the same. He also noted that it is rather slow to calculate these results "on the fly." Of course, I said that the data should not be duplicated and should always be calculated if necessary. I suggested using SQL Reporting Services or another solution that will generate these reports overnight and cache them. In any case, he was not convinced. Any comments on this?

Thank you very much:)
Greetings
Mark

EDIT

Thanks for the great answers. It is a pity that I can only mark this as an answer, because there are many good suggestions.

+41
sql database-design normalization denormalization
Nov 29 '10 at 5:30
source share
9 answers

Your senior colleague is a developer, not a data moderator. You better start from scratch, without them. Normalization is complicated only by those who do not read books, and gets their "knowledge" from fans on the wiki. It is true that it makes you think, but some of the questions are absurd.

Your numbers:

  • You need to evaluate the differences between actual online data and historical data; then the difference between historical and archival needs. All of them are suitable for a specific business requirement, but for all the others it is wrong; there is no universal right or falsehood.

    • Why is there no paper copy of the invoice? In most countries, which would be legal and tax requirements, what exactly is the difficulty of catching the old invoice?
    • when there is a requirement in the database to store closed invoices, then it is imperative that as soon as the account is closed, you need a way to capture this information.
    • ProductPrice (in fact, I would call it ProductDate ) is a good idea, but may not be needed. But you're right, you need to evaluate the data currency in the full context of the entire database.
    • I don’t see how copying the price of the product to the table of accounts will help (are there many positions?)
    • in modern databases where copying of an invoice is required, the closed invoice is additionally stored in another form, such as XML. One client saves PDF files as blobs. Thus, there is no problem with the fact that the price of the product was five years ago. But the invoice master data is online and current, even for closed invoices; you simply cannot recalculate an ancient account using current prices.
    • some people use the archive_invoice table, but this has problems because now each code segment or user report tool should look in two places (note that some users better understand databases these days than most developers do).
    • In any case, this is all a discussion, for your understanding. None of the databases that I wrote in 30 years has ever encountered such a problem, and all of them met the requirements of legislation and taxes.
      • The database serves current and archive purposes from one set of tables (no "archive" tables
      • After creating an invoice, it is a legal document and cannot be changed or deleted (it can be canceled or partially credited using a new invoice with negative values). They are marked with IsIssued/IsPaid/Etc
      • Products cannot be deleted, they can be marked IsObsolete
      • There are separate tables for InvoiceHeader and InvoiceItem
      • InvoiceItem has FKs for both InvoiceHeader and Product
      • for many reasons (not just the ones you mention) the InvoiceItem line contains NumUnits; ProductPrice; TaxAmount; ExtendedPrice NumUnits; ProductPrice; TaxAmount; ExtendedPrice NumUnits; ProductPrice; TaxAmount; ExtendedPrice . Of course, it looks like "denormalization", but it is not, because prices, tax rates, etc. Can be changed. But more importantly, the legal requirement is that we can reproduce the old texture on demand.
      • (where it can be reproduced from paper files, this is not required)
      • InvoiceTotalAmount is a derived column, just SUM() for InvoiceItems
        ,
  • This is trash. Accounting systems and accountants do not work like that.

    • If it is a true accounting system, then it will have JournalEntries or double entry; this is what a qualified account is required (by law).

      • Double entry does not mean duplicate entries; this means that each financial transaction (one amount) must have a source account and a target account to which it applies; therefore there is no “denormalization” or duplication. In the banking database, since financial transactions refer to separate accounts, which are usually displayed as two separate financial transactions (lines) within the same transaction Db. The usual limitations of a commercial database are used to ensure that there are two “parties” to each financial transaction.
        .
    • Ensuring that invoices cannot be deleted is a separate security issue, etc. if someone is paranoid about things removed from their database and their database was not protected by a qualified person, then they have more and more problems that have nothing to do with this issue. Get a security audit and do whatever they tell you.

    • There are several people on this site who believe that a wiki is a place where you can learn something. This is not true. This is a cesspool of "definitions" written by amateurs, and "definitions" are constantly changing by other amateurs. There is no fixed definition you can rely on. So don’t worry about what the wiki says or what people say on the wiki as soon as they mention the wiki, you know that their “knowledge” comes from reading without qualification; and what they read is an ever-changing cesspool. They will predictably argue about “definitions” because they have no real experience; experienced will just work with work

    • A normalized database is always much faster than a normalized database. Therefore, it is very important to understand what is normalization and denormalization, and what is not. This process is very difficult when people have “current” and “amateur” definitions, it simply leads to confusion and “depletion” of time. When you have fixed definitions, you can avoid all of this and just go ahead.

    • The summary tables are quite normal in order to save time and processing power, recalculate information that does not change, for example: the results of the year since the beginning of the year for each year, but this year; MTD for each month this year, but not this month. “Always recounting” data is a little silly when (a) the information is very large and (b) does not change. Calculate for current month only

      • In banking systems (millions of transactions per day) at EndOfDay, we also calculate and store Daily Total. They are being rewritten in the last five days because Audiitors are making changes and JournalEntries against financial transactions in the last 5 days are allowed.
      • non-banking systems usually do not need daily totals
        ,
    • Pivot tables are not “denormalization” (except for those who have just learned about “normalization” from their magical, ever-changing fluid “source” or from non-practitioners who apply simple black and white rules to everything). Again, the definition is not discussed here; it just doesn't apply to PivotTables.

    • Pivot tables do not affect data integrity (assuming, of course, that the data from which they were derived was inalienable).

    • Pivot tables are an addition to the database, which do not have to have the same limitations as the database. There are report tables or data warehouse tables, unlike database tables.

    • There are no update anomalies (this is a strict definition) associated with PivotTables. You cannot change or delete an invoice from last year. Update anomalies apply to the current Denormalised or Unnormalised current data.

+41
Nov 30 '10 at 2:52
source share

1) This is an archive. Everything in it should never be updated. I would go with the suggestion of the older guy, and this table of accounts will be autonomous. Perhaps use blob for the account itself, which contains the markup language?

2) Reporting services, a repository table that is updated by a trigger, something you create with a script when ... all of this will be fine, I think. It is really ideal to normalize, but it is not always fast. I have a good healthcare database that I manage, which is completely normalized ... and then has a series of de-normalized tables with expanded equations and usually extensible fields. Almost everything comes from this irregular set - it simply adds a trigger to them when files are loaded, than I need to constantly extract from different tables every time I want to see a report for 100,000 records.

+7
Nov 29 '10 at 5:39
source share

You raise actual points, however you did not fully understand normalization and what this means, for example, in

1) The statement that the maintenance of invoices, as they denormalize the data, is completely and completely erroneous. For example, let’s take a price - if you have a business requirement that says that you must save the price history, then saving only the current price is incorrect, and this violates the requirements. And this has nothing to do with normalization; it just is not well designed. Denormalization is the introduction of ambiguity into your model (and other artifacts) - in which case you simply do not model your problem space properly. There is nothing wrong with modeling a database to support temporary data (or versioning and / or dividing database areas into an archive / temporary and working set).

Looking at normalization, without looking at semantics (in terms of requirements), is impossible.

In addition, if your senior developer cannot see the difference, I think that he did not receive his seniority in the development of RDBMS;)

2) The second part is really denormalization. However, if you ever come across a senior BB analyst who seriously preaches normalization, you will hear him say that it’s quite acceptable to denormalize while you do this consciously, and make sure that overweight deficiencies and these anomalies do not bite you. They will also tell you to normalize the logical model and that in the physical model you are allowed to deviate from the ideal for various purposes (performance, maintenance, etc.). In my book, the main goal of normalization is that you have no hidden anomalies (see this article at 5NF )

Caching intermediate results is allowed even in normalized databases and even with the help of the largest normalization evangelists - you can do this at the application level (like some cache), or you can do it at the database level, or you can get a data store for such purposes . All of them are valid and have nothing to do with the normalization of the logical model.

In addition, as for your accountant, you should be able to convince him that what he claims is not a good test and develops a set of tests (maybe with him) that automate system testing without user intervention and give you more confidence that your system is not working.

On the other hand, I know systems that require users to enter duplicate information, for example, to enter the number of lines in the account before or after entering the actual lines, to ensure that the recording is complete. This data is "duplicated", and you do not need to store it if you have a procedure that will check the input. If this procedure comes later, it is allowed to store "denormalized" data - again, the semantics justify it, and you can look at the model as normalized. (it is useful to wrap your head around this concept)

EDIT: The term “denormalized” in (2) is incorrect if you look at the formal definition of normal forms and if you think a design is denormalized if it violates any of the normal forms (for some, this is obvious and there is no other way about this).

However, you can get used to the idea that many people and unnecessary useless texts will use the term “normalization” for any efforts that try to reduce redundancy in the database (as an example, you will find scientific documents in which I do not say that they should be right, just like the warning that this is a common occurrence that cause derived attributes as a form of denormalization, see here ).

If you want to appeal to some more consistent and recognized authorities (again, not recognized by everyone), perhaps the words CJDate may make a clear distinction:

Most of the design theory is associated with reduced redundancy; normalization reduces redundancy within relvars, orthogonality reduces its relvars.

qouted from Depth Database: Relational Relations Theory for Practitioners

and on the next page

as the inability to normalize the whole path implies redundancy and can lead to some anomalies, and non-compliance with orthogonality.

So the correct term for redundancy through relvars is orthogonality (basically all normal forms talk about the same reverb, so if you look strictly at normalization, it will never offer any improvements due to the dependencies between two different reliefs).

In any case, one of the other important concepts when considering database design is also the difference between logical and physical database models. Many things that are useful at the physical level, such as tables with subtotals or indexes, do not have a place in the logical model - where you are trying to establish and explore the relationships between the concepts you are trying to model. And that is why you can say that they are valid and they do not destroy the design.

Lines can sometimes be a little blurry in the logical model and physical model. A particularly good example is a table with subtotals. To consider it part of the physical implementation and ignore it at the logical level, you must:

  • ensure that users (and applications) cannot update the subtotals table directly that do not match their predicate (in other words, there is an error in the test procedure)
  • make sure that users (and applications) cannot update the table they depend on without updating the subtotal (in other words, some application will not delete a row from the details table without updating the total)

If you violate any of the above rules, you will receive an inconsistent database that will provide conflicting facts. (In this case, if you want to formally develop a procedure for fixing or studying the problems caused, you do not consider it simply an additional table, it will exist at a logical level, but it should not be there).

In addition, normalization always depends on the semantics and business rules that you are trying to model. For example, DBAPerformance gives an example in which storing TaxAmount in a transaction table is not a denormalized design, but it does not mention that it depends on which system you are trying to simulate (is this obvious?); for example, if a transaction has another attribute named TaxRate , it will usually be denormalized because there is a functional dependence on the set of non-key attributes (TaxAmount = Amount * TaxRate => FD: Amount, TaxRate → TaxAmount) and one of them should be deleted or guaranteed that they will be agreed.

Obviously, you could say, but if the system you are building is intended for an audit company, then you may not have a functional dependency - they may audit someone who uses manual calculations or has faulty software or should be able to record incomplete data, and the calculation may be erroneous initially, and as an audit company you should record the fact how this happened.

Thus, the semantics (predicates) that are defined by the requirements will affect if any of the normal forms is violated - affecting the functional dependencies (in other words, the correct establishment of functional dependencies is a very important part of modeling when you are striving for a normalized database )

+5
Nov 29 '10 at
source share

I agree with your elder about (1). The row of the transaction table should record all the state at the time of the transaction. Period. What you offer does not record actual data, so it is not valid. I also agree (2). No matter what the business wants through cross-validation, you must implement. Accounting is based on cross-validation, double entry, collapse of ledgers, etc. You have to do it. It is so fundamental that you should not even consider it as denormalization, as well as fulfilling business requirements.

+3
Nov 29 '10 at 5:48
source share

1) Does not require denormalization. You just need to determine what level of detail for each change you need and save it using the appropriate key.

2) It has nothing to do with denormalization. Saving summary data does not denormalize the database. Storing the results obtained from non-key attributes in the same table will be an example of denormalization, but this is not like what you are talking about here.

+3
Nov 29 '10 at 10:32
source share

Your senior developer makes extremely valid points. I myself learned this by serving systems that do not distort historical data.

In a sense, this does not add any overhead to the database. You create invoice tables from existing data in the database. An invoice is a snapshot. - , , . , , , -.

. , , ( , ). , , . , : http://jlrand.com/?p=95

+3
01 . '11 18:13
source share

# 1

. , ////etc, .

# 2

db - . , -, . , , - CFO DBA , , , .

SQL Server, Adventure Works db. MS, Adventure Works .

+1
30 . '10 0:02
source share

sql ( ).

, . , , sql, .

0
29 . '10 5:52
source share

, , . . : .

0
05 . '16 23:48
source share