Partially denormalizing unicorn sightings

Question

Partially denormalizing unicorn sightings

There are several researchers who watch the remaining unicorns in Easter Island ¹ . Each day, researchers record which unicorn they saw, the date of sight, the number of babies each unicorn has, and whether they were drunk when observed. They are individually uploaded to a central location, and then every day they give me a flat file of all new observations.

I have a table that looks like it contains information:

create table unicorn_observations ( observer_id number not null , unicorn_id number not null , created date not null -- date the record was inserted into the database , lastseen date not null -- date the record was last seen , observation_date date not null , no_of_babies number not null , drunk varchar2(1) not null , constraint pk_uo primary key ( observer_id, unicorn_id, created ) , constraint chk_uo_babies check ( no_of_babies >= 0 ) , constraint chk_uo_drunk check ( drunk in ('y','n') ) );

The table is separately unique to observer_id , unicorn_id and observation_date or lastseen .

Sometimes Cobold [sic] output control is a bit erroneous and re-outputs the same data twice. In this situation, I update lastseen instead of creating a new record. I only do this in situations where each column is the same

Unfortunately, researchers are not fully aware of the third normal form. Each month they upload observations of previous months for several unicorns, even if no new comments were made. They do this with the new observation_date , which means a new record is inserted into the table.

I have a separate created and lastseen for complete traceability, as researchers sometimes submit some observations late. They are created by the database and are not part of the information provided.

Here are some examples of data (with partially changed column names so that they match without a scrollbar).

  + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +
 |  OBS_ID |  UNI_ID |  CREATED |  LASTSEEN |  OBS_DATE |  #BABIES |  DRUNK |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +
 |  1 |  1 |  01-NOV-11 |  01-NOV-11 |  31-OCT-11 |  10 |  n |
 |  1 |  2 |  01-NOV-11 |  01-NOV-11 |  31-OCT-11 |  10 |  n |
 |  1 |  3 |  01-NOV-11 |  01-NOV-11 |  31-OCT-11 |  10 |  n |
 |  1 |  6 |  10-NOV-11 |  10-NOV-11 |  07-NOV-11 |  0 |  n |
 |  1 |  1 |  17-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |
 |  1 |  2 |  17-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |
 |  1 |  3 |  17-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |
 |  1 |  6 |  17-NOV-11 |  17-NOV-11 |  17-NOV-11 |  0 |  n |
 |  1 |  6 |  01-DEC-11 |  01-DEC-11 |  01-DEC-11 |  0 |  n |
 |  1 |  6 |  01-JAN-12 |  01-JAN-12 |  01-JAN-12 |  3 |  n |
 |  1 |  6 |  01-FEB-12 |  01-FEB-12 |  01-FEB-12 |  0 |  n |
 |  1 |  6 |  01-MAR-12 |  01-MAR-12 |  01-MAR-12 |  0 |  n |
 |  1 |  6 |  01-APR-12 |  01-APR-12 |  01-APR-12 |  0 |  n |
 |  1 |  1 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |
 |  1 |  2 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |
 |  1 |  3 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |
 |  1 |  6 |  01-MAY-12 |  01-MAY-12 |  01-MAY-12 |  0 |  n |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- +

I would like to partially denormalize these observations so that if a new record is received with the same observer_id , unicorn_id , no_of_babies and drunk (payload), but with the new observation_date update, I will have a new column in the table, last_observation_date , instead of inserting a new record . I would still lastseen in this situation.

I need to do this because I have a number of complex unicorn related questions that join this table; the researchers download old observations with new dates about 10 m once a month, and I get about 9 million really new records a month. I ran for a year and already have 225 m of unicorn sightings. Since I need to know the last observation date for each payload combination, I would prefer to significantly reduce the size of the table and save a lot of time scanning it completely.

This means that the table will be as follows:

 create table unicorn_observations ( observer_id number not null , unicorn_id number not null , created date not null -- date the record was inserted into the database , lastseen date not null -- date the record was last seen , observation_date date not null , no_of_babies number not null , drunk varchar2(1) not null , last_observation_date date , constraint pk_uo primary key ( observer_id, unicorn_id, created ) , constraint chk_uo_babies check ( no_of_babies >= 0 ) , constraint chk_uo_drunk check ( drunk in ('y','n') ) );

and the data stored in the table will look like this: it does not matter whether last_observation_date zero or not if the observation was only "noticed" once. I don't need help loading the data, only partially denormalizing the current table to look like this.

  + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  OBS_ID |  UNI_ID |  CREATED |  LASTSEEN |  OBS_DATE |  #BABIES |  DRUNK |  LAST_OBS_DT |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  1 |  6 |  10-NOV-11 |  01-DEC-11 |  07-NOV-11 |  0 |  n |  01-DEC-11 |
 |  1 |  1 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  2 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  3 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  6 |  01-JAN-12 |  01-JAN-12 |  01-JAN-12 |  3 |  n |  |
 |  1 |  6 |  01-FEB-12 |  01-MAY-12 |  01-FEB-12 |  0 |  n |  01-MAY-12 |
 |  1 |  1 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  |
 |  1 |  2 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  |
 |  1 |  3 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +

Obvious answer

 select observer_id as obs_id , unicorn_id as uni_id , min(created) as created , max(lastseen) as lastseen , min(observation_date) as obs_date , no_of_babies as "#BABIES" , drunk , max(observation_date) as last_obs_date from unicorn_observations group by observer_id , unicorn_id , no_of_babies , drunk

doesn't work because it ignores one observation of 3 unicorn children for unicorn 6 on January 1, 2011 in ^st ; this, in turn, means that the lastseen for a record created on November 10th is incorrect.

  + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  OBS_ID |  UNI_ID |  CREATED |  LASTSEEN |  OBS_DATE |  #BABIES |  DRUNK |  LAST_OBS_DT |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  1 |  1 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  2 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  3 |  01-NOV-11 |  17-NOV-11 |  09-APR-11 |  10 |  n |  31-OCT-11 |
 |  1 |  6 |  10-NOV-11 |  01-MAY-12 |  07-NOV-11 |  0 |  n |  01-MAY-12 |
 |  1 |  6 |  01-JAN-12 |  01-JAN-12 |  01-JAN-12 |  3 |  n |  01-JAN-12 |
 |  1 |  1 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  19-APR-12 |
 |  1 |  2 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  19-APR-12 |
 |  1 |  3 |  19-APR-12 |  19-APR-12 |  19-APR-12 |  7 |  y |  19-APR-12 |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +

Currently, I see no way to do this without any procedural logic, i.e. cycle. I would prefer to avoid a loop in this situation, since I would have to completely scan the row table 225 m 260 times (number of great dates created ). Even the use of lag() and lead() must be recursive, since there is an indefinite number of observations per one unicorn.

Is there a way to create this dataset in a single SQL statement?

Table specifications and sample data are also in the SQL Fiddle .

Trying a better explanation:

The problem is that when something was true. On January 1, 2012, Unicorn 6 had 3 children.

A look at unicorn 6 in the “table” created by GROUP BY; if I try to find the number of children at 1 ^st in January, I will get two entries that are contradictory.

  + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  OBS_ID |  UNI_ID |  CREATED |  LASTSEEN |  OBS_DATE |  #BABIES |  DRUNK |  LAST_OBS_DT |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  1 |  6 |  10-NOV-11 |  01-MAY-12 |  07-NOV-11 |  0 |  n |  01-MAY-12 |
 |  1 |  6 |  01-JAN-12 |  01-JAN-12 |  01-JAN-12 |  3 |  n |  01-JAN-12 |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +

However, I only need one row, as in the second table. Here for any moment in time there is no more than one “correct” value, because two time periods, when one unicorn 6 had 0 babies, were divided into two lines by the day when he had 3.

  + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  OBS_ID |  UNI_ID |  CREATED |  LASTSEEN |  OBS_DATE |  #BABIES |  DRUNK |  LAST_OBS_DT |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +
 |  1 |  6 |  10-NOV-11 |  01-DEC-11 |  07-NOV-11 |  0 |  n |  01-DEC-11 |
 |  1 |  6 |  01-JAN-12 |  01-JAN-12 |  01-JAN-12 |  3 |  n |  |
 |  1 |  6 |  01-FEB-12 |  01-MAY-12 |  01-FEB-12 |  0 |  n |  01-MAY-12 |
 + -------- + -------- + ----------- + ----------- + ------- ---- + --------- + ------- + ------------- +

<sub> 1. graze around moai

+7

sql oracle gaps-and-islands

Ben Oct 4 '12 at 9:01

source share

3 answers

Try it.

 with cte as ( select v.*, ROW_NUMBER() over (partition by grp, unicorn_id order by grp, unicorn_id) rn from ( select u.*, ROW_NUMBER() over (partition by unicorn_id order by no_of_babies, drunk, created ) -ROW_NUMBER() over (partition by unicorn_id order by created) as grp from unicorn_observations u ) v ) select observer_id, cte.unicorn_id, mincreated,maxlastseen,minobsdate,no_of_babies,drunk,maxobsdate from cte inner join ( select unicorn_id, grp, min(created) as mincreated, max(lastseen) as maxlastseen, min(observation_date) as minobsdate, max(observation_date) as maxobsdate from cte group by unicorn_id, grp ) v on cte.grp = v.grp and cte.unicorn_id = v.unicorn_id where rn=1 order by created;

+2

podiluska Oct 4 '12 at 10:31

source share

This type of problem can be solved by first creating several flags in the subquery, then using them.

 with obs_flags as ( select observer_id as obs_id , unicorn_id as uni_id , case when lag(observation_date) over ( partition by unicorn_id, no_of_babies, drunk order by unicorn_id, observation_date ) is null then 1 else 0 end as group_start , case when lead(observation_date) over ( partition by unicorn_id, no_of_babies,drunk order by unicorn_id, observation_date ) is null then 1 else 0 end as group_end , observation_date , no_of_babies , drunk , lastseen , created from unicorn_observations ) select obs_start.obs_id , obs_start.uni_id , obs_start.created , obs_end.lastseen as lastseen , obs_start.observation_date , obs_start.no_of_babies as "#BABIES" , obs_start.drunk , obs_end.observation_date as last_obs_date from obs_flags obs_start join obs_flags obs_end on obs_start.group_start = 1 and obs_end.group_end = 1 and obs_start.uni_id = obs_end.uni_id and obs_start.no_of_babies = obs_end.no_of_babies and obs_start.drunk = obs_end.drunk and obs_start.observation_date <= obs_end.observation_date and --Only join with the first end point we find: not exists ( select * from obs_flags f where obs_start.uni_id = f.uni_id and obs_start.no_of_babies = f.no_of_babies and obs_start.drunk = f.drunk and f.group_end = 1 and f.observation_date < obs_end.observation_date and f.observation_date >= obs_start.observation_date );

This is a difficult problem; Perhaps I did not completely fulfill your requirements (or there may be a typo. I do not have Oracle for testing). However, this should give you an idea of how this can be done.

Basically, you first find all the start and end records of the periods you are interested in. Then you attach each start entry to the next end entry within the same grouping.

Update: my original code did not verify that the end came after launch. I fixed it.

Update2: as Ben pointed out, the sentence not exists will be slow. The alternative that helped me speed up my work in the past is to do this in two steps: first find all the possible pairs, and then separately select only the correct pairs from this.

In this case, attach each obs_start to each potentially correct obs_end in a temporary table or subquery.

Then from these mates, select the one that has the earliest obs_end for each obs_start .

0

dan1111 Oct 4 '12 at 9:46

source share

Alex poole · Accepted Answer · 2012-10-04T22:53:14+0000

Based on what I think you are trying to do, mainly based on your update regarding specific problems with Unicorn 6, I think this gives the result you want. It does not need recursive lead and lag , but requires two levels.

 select * from ( select observer_id, unicorn_id, case when first_obs_dt is null then created else lag(created) over (order by rn) end as created, case when last_obs_dt is null then lastseen else lead(lastseen) over (order by rn) end as lastseen, case when first_obs_dt is null then observation_date else lag(observation_date) over (order by rn) end as observation_date, no_of_babies, drunk, case when last_obs_dt is null then observation_date else null end as last_obs_dt from ( select observer_id, unicorn_id, created, lastseen, observation_date, no_of_babies, drunk, case when lag_no_babies != no_of_babies or lag_drunk != drunk or lag_obs_dt is null then null else lag_obs_dt end as first_obs_dt, case when lead_no_babies != no_of_babies or lead_drunk != drunk or lead_obs_dt is null then null else lead_obs_dt end as last_obs_dt, rownum rn from ( select observer_id, unicorn_id, created, lastseen, observation_date, no_of_babies, drunk, lag(observation_date) over (partition by observer_id, unicorn_id, no_of_babies, drunk order by observation_date) lag_obs_dt, lag(no_of_babies) over (partition by observer_id, unicorn_id, drunk order by observation_date) lag_no_babies, lag(drunk) over (partition by observer_id, unicorn_id, no_of_babies order by observation_date) lag_drunk, lead(observation_date) over (partition by observer_id, unicorn_id, no_of_babies, drunk order by observation_date) lead_obs_dt, lead(no_of_babies) over (partition by observer_id, unicorn_id, drunk order by observation_date) lead_no_babies, lead(drunk) over (partition by observer_id, unicorn_id, no_of_babies order by observation_date) lead_drunk from unicorn_observations order by 1,2,5 ) ) where first_obs_dt is null or last_obs_dt is null ) where last_obs_dt is not null order by 1,2,3,4;

What gives:

 OBSERVER_ID UNICORN_ID CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_ ----------- ---------- --------- --------- --------- ------------ - --------- 1 1 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11 1 1 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12 1 2 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11 1 2 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12 1 3 17-NOV-11 01-NOV-11 09-APR-11 10 n 31-OCT-11 1 3 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12 1 6 10-NOV-11 01-DEC-11 07-NOV-11 0 n 01-DEC-11 1 6 01-JAN-12 01-JAN-12 01-JAN-12 3 n 01-JAN-12 1 6 01-FEB-12 01-MAY-12 01-FEB-12 0 n 01-MAY-12 9 rows selected.

He got three entries for unicorn 6, but lastseen and observation_date for the third is the return path to your sample, so I'm not sure I still don't understand this. I suggested that you want to keep the earliest observation_date and last lastseen in each group, assuming that this seems to happen when new entries are added, but I'm not sure ...

Thus, the innermost query gets the raw data from the table and gets the lead and lag columns for the observation_date and no_of_babies and drunk columns, using slightly different partitions. order by so rownum can be used later, obtained in the next step and used to order in one after that. Just for unicorn 6 for brevity:

 CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L --------- --------- --------- ------------ - --------- ------------- - --------- -------------- - 10-NOV-11 10-NOV-11 07-NOV-11 0 n 17-NOV-11 0 n 17-NOV-11 17-NOV-11 17-NOV-11 0 n 07-NOV-11 0 n 01-DEC-11 0 n 01-DEC-11 01-DEC-11 01-DEC-11 0 n 17-NOV-11 0 n 01-FEB-12 3 n 01-JAN-12 01-JAN-12 01-JAN-12 3 n 0 0 01-FEB-12 01-FEB-12 01-FEB-12 0 n 01-DEC-11 3 n 01-MAR-12 0 n 01-MAR-12 01-MAR-12 01-MAR-12 0 n 01-FEB-12 0 n 01-APR-12 0 n 01-APR-12 01-APR-12 01-APR-12 0 n 01-MAR-12 0 n 01-MAY-12 0 n 01-MAY-12 01-MAY-12 01-MAY-12 0 n 01-APR-12 0 n

The next level obscures the lead and lag values for observation_date , if the num_of_babies or drunk value was changed - you only specifically referred to splitting on the child’s account, but I assume that you want to split into sobriety too. After that, all that is null for first_obs_date or last_obs_date is the beginning or end of the mini range.

 CREATED LASTSEEN OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_ RN --------- --------- --------- ------------ - --------- --------- ---------- 10-NOV-11 10-NOV-11 07-NOV-11 0 n 17-NOV-11 1 17-NOV-11 17-NOV-11 17-NOV-11 0 n 07-NOV-11 01-DEC-11 2 01-DEC-11 01-DEC-11 01-DEC-11 0 n 17-NOV-11 3 01-JAN-12 01-JAN-12 01-JAN-12 3 n 4 01-FEB-12 01-FEB-12 01-FEB-12 0 n 01-MAR-12 5 01-MAR-12 01-MAR-12 01-MAR-12 0 n 01-FEB-12 01-APR-12 6 01-APR-12 01-APR-12 01-APR-12 0 n 01-MAR-12 01-MAY-12 7 01-MAY-12 01-MAY-12 01-MAY-12 0 n 01-APR-12 8

Everything that is not the beginning or end of the mini-range can now be ignored, since the values either coincide or are replaced by those that were before or after. This applies to an indefinite number of observational problems — no matter how much you ignore at this stage. Thus, the next level eliminates these intermediate values by filtering the lines where both first_obs_dt and last_obs_dt not equal to zero. Inside this filtered set there is a second level of lead and lag to get the first or last value for each date - and that bit, which I'm not sure, is right, because it does not match one of your samples.

 CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_ --------- --------- --------- ------------ - --------- 10-NOV-11 01-DEC-11 07-NOV-11 0 n 10-NOV-11 01-DEC-11 07-NOV-11 0 n 01-DEC-11 01-JAN-12 01-JAN-12 01-JAN-12 3 n 01-JAN-12 01-FEB-12 01-MAY-12 01-FEB-12 0 n 01-FEB-12 01-MAY-12 01-FEB-12 0 n 01-MAY-12

Finally, the remaining lines that do not have last_obs_dt are filtered out.

Now I will wait to find out which bits I misunderstood ... * 8-)

After correcting the order of lead and lag , the same information for each stage for unicorn 1:

 CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAG_OBS_D LAG_NO_BABIES L LEAD_OBS_ LEAD_NO_BABIES L --------- --------- --------- ------------ - --------- ------------- - --------- -------------- - 17-NOV-11 17-NOV-11 09-APR-11 10 n 31-OCT-11 10 n 01-NOV-11 01-NOV-11 31-OCT-11 10 n 09-APR-11 10 n 19-APR-12 19-APR-12 19-APR-12 7 y CREATED LASTSEEN OBSERVATI NO_OF_BABIES D FIRST_OBS LAST_OBS_ RN --------- --------- --------- ------------ - --------- --------- ---------- 17-NOV-11 17-NOV-11 09-APR-11 10 n 31-OCT-11 1 01-NOV-11 01-NOV-11 31-OCT-11 10 n 09-APR-11 2 19-APR-12 19-APR-12 19-APR-12 7 y 3 CREATED LASTSEEN OBSERVATI NO_OF_BABIES D LAST_OBS_ --------- --------- --------- ------------ - --------- 17-NOV-11 17-NOV-11 09-APR-11 10 n 09-APR-11 19-APR-12 19-APR-12 19-APR-12 7 y 19-APR-12

I am not sure what will happen to the saved observation_date and lastseen when the original data was entered outside a sequence like this, or what you will do in this situation with new entries added in the future.

Partially denormalizing unicorn sightings

More articles: