There are several researchers who watch the remaining unicorns in Easter Island 1 . Each day, researchers record which unicorn they saw, the date of sight, the number of babies each unicorn has, and whether they were drunk when observed. They are individually uploaded to a central location, and then every day they give me a flat file of all new observations.
I have a table that looks like it contains information:
create table unicorn_observations ( observer_id number not null , unicorn_id number not null , created date not null
The table is separately unique to observer_id , unicorn_id and observation_date or lastseen .
Sometimes Cobold [sic] output control is a bit erroneous and re-outputs the same data twice. In this situation, I update lastseen instead of creating a new record. I only do this in situations where each column is the same
Unfortunately, researchers are not fully aware of the third normal form. Each month they upload observations of previous months for several unicorns, even if no new comments were made. They do this with the new observation_date , which means a new record is inserted into the table.
I have a separate created and lastseen for complete traceability, as researchers sometimes submit some observations late. They are created by the database and are not part of the information provided.
Here are some examples of data (with partially changed column names so that they match without a scrollbar).
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +
| OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +
| 1 | 1 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n |
| 1 | 2 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n |
| 1 | 3 | 01-NOV-11 | 01-NOV-11 | 31-OCT-11 | 10 | n |
| 1 | 6 | 10-NOV-11 | 10-NOV-11 | 07-NOV-11 | 0 | n |
| 1 | 1 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n |
| 1 | 2 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n |
| 1 | 3 | 17-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n |
| 1 | 6 | 17-NOV-11 | 17-NOV-11 | 17-NOV-11 | 0 | n |
| 1 | 6 | 01-DEC-11 | 01-DEC-11 | 01-DEC-11 | 0 | n |
| 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n |
| 1 | 6 | 01-FEB-12 | 01-FEB-12 | 01-FEB-12 | 0 | n |
| 1 | 6 | 01-MAR-12 | 01-MAR-12 | 01-MAR-12 | 0 | n |
| 1 | 6 | 01-APR-12 | 01-APR-12 | 01-APR-12 | 0 | n |
| 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y |
| 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y |
| 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y |
| 1 | 6 | 01-MAY-12 | 01-MAY-12 | 01-MAY-12 | 0 | n |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +
I would like to partially denormalize these observations so that if a new record is received with the same observer_id , unicorn_id , no_of_babies and drunk (payload), but with the new observation_date update, I will have a new column in the table, last_observation_date , instead of inserting a new record . I would still lastseen in this situation.
I need to do this because I have a number of complex unicorn related questions that join this table; the researchers download old observations with new dates about 10 m once a month, and I get about 9 million really new records a month. I ran for a year and already have 225 m of unicorn sightings. Since I need to know the last observation date for each payload combination, I would prefer to significantly reduce the size of the table and save a lot of time scanning it completely.
This means that the table will be as follows:
create table unicorn_observations ( observer_id number not null , unicorn_id number not null , created date not null
and the data stored in the table will look like this: it does not matter whether last_observation_date zero or not if the observation was only "noticed" once. I don't need help loading the data, only partially denormalizing the current table to look like this.
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 |
| 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | |
| 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 |
| 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | |
| 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | |
| 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
Obvious answer
select observer_id as obs_id , unicorn_id as uni_id , min(created) as created , max(lastseen) as lastseen , min(observation_date) as obs_date , no_of_babies as "#BABIES" , drunk , max(observation_date) as last_obs_date from unicorn_observations group by observer_id , unicorn_id , no_of_babies , drunk
doesn't work because it ignores one observation of 3 unicorn children for unicorn 6 on January 1, 2011 in st ; this, in turn, means that the lastseen for a record created on November 10th is incorrect.
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| 1 | 1 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 2 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 3 | 01-NOV-11 | 17-NOV-11 | 09-APR-11 | 10 | n | 31-OCT-11 |
| 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 |
| 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 |
| 1 | 1 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 |
| 1 | 2 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 |
| 1 | 3 | 19-APR-12 | 19-APR-12 | 19-APR-12 | 7 | y | 19-APR-12 |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
Currently, I see no way to do this without any procedural logic, i.e. cycle. I would prefer to avoid a loop in this situation, since I would have to completely scan the row table 225 m 260 times (number of great dates created ). Even the use of lag() and lead() must be recursive, since there is an indefinite number of observations per one unicorn.
Is there a way to create this dataset in a single SQL statement?
Table specifications and sample data are also in the SQL Fiddle .
Trying a better explanation:
The problem is that when something was true. On January 1, 2012, Unicorn 6 had 3 children.
A look at unicorn 6 in the “table” created by GROUP BY; if I try to find the number of children at 1 st in January, I will get two entries that are contradictory.
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| 1 | 6 | 10-NOV-11 | 01-MAY-12 | 07-NOV-11 | 0 | n | 01-MAY-12 |
| 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | 01-JAN-12 |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
However, I only need one row, as in the second table. Here for any moment in time there is no more than one “correct” value, because two time periods, when one unicorn 6 had 0 babies, were divided into two lines by the day when he had 3.
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| OBS_ID | UNI_ID | CREATED | LASTSEEN | OBS_DATE | #BABIES | DRUNK | LAST_OBS_DT |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
| 1 | 6 | 10-NOV-11 | 01-DEC-11 | 07-NOV-11 | 0 | n | 01-DEC-11 |
| 1 | 6 | 01-JAN-12 | 01-JAN-12 | 01-JAN-12 | 3 | n | |
| 1 | 6 | 01-FEB-12 | 01-MAY-12 | 01-FEB-12 | 0 | n | 01-MAY-12 |
+ -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
<sub> 1. graze around moai