When is normal normally normal?

To illustrate my question, consider the following relationship:

Person( name, street, city, zipcode ) name -> street , city , zipcode street + city -> zipcode 

So, if we know the name, we also know where the person lives. But zipcode also (transient) depends on street + city. Thus, this relationship violates 3NF and should be split into two tables to match.

But in this case, we are not interested in zipcodes as a separate object. This is part of the address, and it is just a transient dependency. We will never use it separately.

I understand why normalization is good. But is it really necessary to always normalize (and thereby make the database more complex)? If not, how do you know when you can skip it?

(if my terminology or notation is incorrect, you can correct me)

+7
source share
5 answers

Normalization is a tool for analyzing dependencies and ensuring the correct implementation of data integrity rules (business rules) presented as dependencies. The fundamental assumption of normalization is that you know or can determine which business rules you really want to implement. If you are already sure that you do not want or should apply this business rule, then it probably makes little sense to consider it as a dependency when developing a database for it. Remember that the dependency point is that the rule always applies to all possible data in the database; not just for current data or some specific subset of data.

It is possible that the {street, city} → {zipcode} dependency is not the desired business rule for the system and therefore should not be applied. For example. if data must be entered without address verification software, it may not be practical to ensure that zipcodes will comply with this. This does not mean that you are breaking any normalization rule. It simply means that the functional dependency is not designed to be held and is not fulfilled, and therefore it is not a transitive dependency in any real sense.

+3
source

In addition to performance, another reason for not quite normalizing may be if you have a certain “fuzziness” in your data.

As far as I understand, 1 ZIP can be specific to a city block or region, which means that a particularly long street can have more than one ZIP. And even if the ZIP really matches the city + street in the USA, this may not be true for postal codes in other countries if you ever decide to go international.

But even assuming that the zip codes are indeed city + street specific, people probably enter the address information themselves, which means that they can be wrong, including the wrong ZIP. Thus, you can create two ZIP addresses for the same combination of city and street.

A completely normalized database simply has no idea about this - you will need to somehow choose one of the ZIP addresses. If you do not have access to a complete, updated database of all ZIP addresses, you have no good way to resolve this conflict. If you end up picking the wrong ZIP, all persons in the same city + street will have the wrong ZIP.

On the other hand, a de-normalized database will allow each person to save their own ZIP, and then will suffer the consequences in isolation from other persons. You can even implement an autocomplete offer and are "are you sure?" warning if the user enters a different ZIP address for an existing city + street that already has a ZIP, but then let him (or her) continue if he indicated that he is sure.


1 And I do not live in the USA, so I could disconnect.

+4
source

The cost and the cost of normalization depend on the cost. It depends mainly on what you will do with the data.

There are (at least) two fundamentally different ways of using data. One of them is real-time transaction processing (OLTP). The other is On Line Analytical Processing (OLAP).

In OLTP, the cost of non-normalization can be quite high. Transactions become more complex and slow, and bottlenecks degrade performance. In OLAP, the benefits of normalization are limited, and there are other design disciplines that can do more for the same effort. One of these alternatives is the star pattern scheme that you can find.

But the point is not so much in normalization, or in rationing, but in following another design discipline, even if it does not lead to a normalized database.

Returning to the specific case that you have outlined, there are many systems in which there is a heavy transactional load on client activity, but the client table is used for read-only purposes in these transactions.

Failure to comply with 3NF will only hurt you when you need to enter a new customer, and you will have to enter the zip code again when there are already other customers with the same city, street and mail code. And in the event that the post office changes the purpose of the zip code on this street, you will have to update many addresses instead of one row in a normalized table.

This is not a very expensive, but not very likely event.

On the other hand, how likely is it for a post office to take one street and divide that street between two postal codes, depending on which block the address is on? If this last event occurs, you are really better off with a structure that violates 3NF. You can enter different postal codes for each address using information provided by the post office about the split.

So how likely is this second scenario? I think this is sooner than the first. But you need to go with your hunch, not mine.

+1
source

I am not American, so I do not dare to say this, but I do not think that you understand zip codes. Some individual buildings have their own zip code. Zipcodes can cross state boundaries. A zip code can represent a PO Box with any geographic significance.

So, without any of the benefits of normalization, your example is a bad choice. There is no clear correlation between (street, city) and zip code.

It’s possible that this is wrong with me, but I know that there can be more than one zip code on the streets of the UK (even on fairly short streets).

+1
source

If {street, city} → {zipcode}, then this restriction must be reported by dbms so that dbms can enforce it. Otherwise, you will soon receive data that looks like this.

 name street city zipcode -- Barack Obama Pennsylvania Ave Washington, DC 90210 

90210 is a postcode, but it is for Beverly Hills, California.

This is a rare application that can really tolerate such bad data.

0
source

All Articles