When to store pre-calculated values ​​and calculate them when retrieving them?

I have a dilemma. I work with a lot of legacy code, and I see a lot of redundant information in the table structures. First of all, they exist in two forms:

a. Backup information to save on "connections". eg:

event_id, event_name, event_creator_id 3 test1 43 subevent_id, event_id, event_creator_id 21 3 43 

Note the duplicate event_creator_id. The rationale that the early developers put forward is that when we need the identifier of the event creator, we just need to query one table, not an “expensive” join, to get the value.

B. Backup information for cost savings. eg:

 event_id, event_default_price 3 100 discount_id, discount_code, discount_percentage 7, ABCD, 50 special_event_id, event_id, discount_id, discounted_price 21 3 7, 50 

Please note that instead of calculating the final “discounted_price” for this special event (since the link to discount_id already exists), the code stores the value “calculated” as it is here. Again, the excuse is “speed,” normal shooting at hell.

I have two questions:

  • I can tell new developers that these structures are not normalized, but they can say that it is faster. How can I resist this? Am I countering this? Others build their databases as follows ?!
  • Is there a rule or set of principles that I can use to say this: “Oh, it will be slower, but only 1%, so it's okay to do it that way”, etc.
+4
source share
2 answers

About your two questions:

I can tell new developers that these structures are not normalized, but they can say that it is faster. How can I resist this? Am I countering this? Others build their databases as follows ?!

It may be faster, but it is not necessary: ​​when you decide to add additional information to the table (additional fields in your case), you also add a penalty for performance, because the table becomes larger, which can mean more data moving from the server to clients, either for unloading into memory or from memory ... also, if the field for query acceleration is likely to have one or more pointers to this, which again has a penalty for performance when updating and inserting, The main thing, however, is about what I hinted at in my comment: “cached” and “pre-computed” values ​​make the system more fragile in terms of data integrity. Are you sure that "event_creator_id" always correctly points to the real creator, even if someone changed the original value? If so, it is also associated with costs, both from the point of view of calculations (you need to update all the tables when the creator changes), and from the point of view of the actual development and testing efforts (you are sure that no one forgot to distribute the changes in the pre-calculated fields ?).

The same applies to aggregate values, such as the "discounted price" or current results ... and a change in the source data, probably much more often than a change in the "event creator" information. Again, is there an appropriate “invalidation caching” mechanism to ensure that total sales are recounted whenever someone completes the sale? How about the returned item? Has anyone considered the cost of ensuring integrity?

Run totals and other derived values ​​should be implemented using representations instead, so that caching, if it does, is done by the actual DBMS mechanism, which knows how to take care of this correctly.

Is there a rule of thumb or a set of principles that I can use to say that “oh, it will be slower, but only 1%, so it's okay to do it like that”, etc.

The database (or, possibly, any computing system) should be “right first” so that you can find how to make it “fast enough, second”. The correctness of trading for speed is a decision that you should not make when developing a database if you do not already know that timeliness is considered more important than correctness. That is, your requirements clearly indicate that the presence of erroneous or outdated information is less important than the response time.

In other words: designing a table with excess cached information is another example of premature optimization and should be avoided at all costs.

See also this - especially the answers.

+16
source

Any db book I read on relational design has always included a section on “planned” redundancy or “limited” de-normalization. It depends on the environment. Wells Fargo pre-calculates the total amounts of bank statements and saves the preliminary deductions.

Imagine how long it will take to perform these calculations if they wait until the end of each cycle when they go to print the instruction.

Planned redundancy is normal!

0
source

All Articles