How to create SQL tables when column data comes in several types / error fields?

I was given a data stack where a certain value was sometimes collected as a date (YYYY-MM-DD), and sometimes as a year.

Depending on how you look at this, it is either a variance of the type or an error.

This is a sub-standard situation, but I cannot afford to restore or delete any data.

What is the optimal (for example, the least worst :)) design of an SQL table that takes any form, avoiding monstrous queries and allowing maximum use of database functions, such as restrictions and keys *?

* i.e. Entity-Attribute-Value is missing.

+4
source share
10 answers

+1 to ninesided , Nikki9696 and Jeff Siver recommendations - I support all of these answers, although none of them were what I decided.

My decision:

  • date column used for full dates only
  • int column used for many years
  • constraint to ensure integrity between the two
  • trigger to populate the year if only the date is provided

Benefits:

  • can run simple (single-table) queries in a date column with missing lost data (using NULL for what it was created for)
  • can run simple (single-table) queries in the year column for any row with a date (since the year is automatically filled)
  • enter either year or date or both (assuming they agree)
  • don't be afraid of disagreements between columns
  • self explanatory, intuitive

I would say that methods using YYYY-01-01 to indicate missing data (when marked as such with a second explanatory column) fail at points 1 and 5.

Sample code for Sqlite 3:

 create table events ( rowid integer primary key, event_year integer, event_date date, check (event_year = cast(strftime("%Y", event_date) as integer)) ); create trigger year_trigger after insert on events begin update events set event_year = cast(strftime("%Y", event_date) as integer) where rowid = new.rowid and event_date is not null; end; -- various methods to insert insert into events (event_year, event_date) values (2008, "2008-02-23"); insert into events (event_year) values (2009); insert into events (event_date) values ("2010-01-19"); -- select events in January without expressions on supplementary columns select rowid, event_date from events where strftime("%m", event_date) = "01"; 
0
source

You can store the components of the year, month, and day in separate columns. Thus, you only need to fill in the columns for which you have data.

+5
source

if it comes in just a year, by default it is 01 for the month and date, YYYY-01-01

That way you can still use the date / datetime data type and not have to worry about invalid dates

+2
source

Either bring it as a row without changes, and change it so that it is consistent in another step, or change the values ​​for the year only during import, as SQLMenace recommends.

+1
source

I would save the value in the DATETIME type and another value (only an integer will do or some kind of enumerated type), which means its accuracy.

It would be easier to give more information if you mentioned what queries you will make with the data.

+1
source
  • Either fix it or save it (OK, not an option)
  • Or save it with fixed computed columns

Something like that

 CREATE TABLE ... ... Broken varchar(20), Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime) 

It also allows you to detect good data from bad source data.

+1
source

If you do not always have a full date, what keys and restrictions do you need? Two data columns may be stored; full date and year. For data that only has a year, the year is stored, and the date is zero. For objects with complete information, both are populated.

+1
source

I would put three columns in a table:

  • The value provided (YYYY-MM-DD or YYYY)
  • A date column, a Date or DateTime data type that is null.
  • The annual column as an integer or char (4) depending on your needs.

I would always populate the year column, populate the date column only when the provided value is a date.

And since you saved the value provided, you can always recycle the road if you need to change it.

+1
source

An alternative solution will be in the form of a date mask (for example, in IP). Save the date in a regular date and time field and insert an additional field of type smallint or something else where you can indicate what is present (maybe even binary):

If you have YYYY-MM-DD , you will have 3 bits of data, which will have a value of 1 if data is present, and 0 if not.

Example:

 Date Mask 2009-12-05 7 (111) 2009-12-01 6 (110, only year and month are know, and day is set to default 1) 2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error) 

Which decision is best depends on what you will do with it.

This is better if you want to select those that have full dates that are between a certain period (less for recording). Also in this way it is easier to compare any dates that have masks like 7,6,4. It can also take up less memory (date + smallint can be less than int + int + int, and only if datetime uses 64 bits and smallint uses as much as int, it will be the same).

+1
source

I was going to offer the same solution as @ninesided made above. In addition, you may have a date field and a field that quantifies your uncertainty. This gives the advantage that you can present things like "around September 23rd, 2010." The problem is that to represent the case where you only know the year, you will need to set a date in the middle of the year with an uncertainty of 182.5 days (assuming it is not a leap year), which seems ugly.

You can use a similar but excellent approach with a mask that represents which part of the date you are sure of is what SQLMenace suggested in its answer above.

0
source

All Articles