Automatic normalization of mySQL database - how to do it?

I have a mySQL database populated with one huge table of 80 columns and 10 million rows. Data may have inconsistencies.

I would like to normalize the database in an automatic and efficient way.

I could do this with java / C ++ / ..., but I would like to do as much as possible inside the database. I assume that any work outside the database will greatly slow down the work.

Suggestions on how to do this? What good resources / tutorials start with?

I'm not looking for any clues about what normalization is (I found a lot of all this using Google)!

+4
source share
3 answers

When clearing messy data, I like to create custom mysql functions for typical data collection ... so you can reuse them later. Approaching this path, you can also see if you have found udf that you can use (with or without modification) ... for example mysqludf.org

+2
source

I can’t figure out how you can automate it. You will need to create the necessary tables, and then execute and replace each piece of data with manual queries.

eg.

INSERT INTO contact SELECT DISTINCT first_name, last_name, phone FROM massive_table; 

then you can remove the columns from the massive table and replace it with the contact_id column.

You will have a similar process when pulling out rows that are in a one-to-many table.

+4
source

You need to examine the columns to identify "similar" entities and break them down into separate tables. At best, an automated tool can identify groups of rows with the same values ​​for some columns, but the person who understands the data will have to decide whether they really belong to a separate group.

Here's a far-fetched example - suppose your columns were first name, last name, address, city, state, zip. An automatic tool can identify rows of people who were members of the same family with the same last name, address, city, state, and mail, and it is incorrect to conclude that these five columns represent an entity. Then it can split tables:

Name, ReferenceID

and another table

ID, last name, address, city, state, zip code

See what I mean?

+3
source

All Articles