SQL to find the first occurrence of datasets in a table

Question

SQL to find the first occurrence of datasets in a table

Tell me if there is a table:

CREATE TABLE T ( TableDTM TIMESTAMP NOT NULL, Code INT NOT NULL );

And I insert a few lines:

 INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:00:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:10:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:20:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:30:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:40:00', 0); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 10:50:00', 1); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:00:00', 1); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:10:00', 1); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:20:00', 0); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:30:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:40:00', 5); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 11:50:00', 3); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:00:00', 3); INSERT INTO T (TableDTM, Code) VALUES ('2011-01-13 12:10:00', 3);

So, I get a table similar to:

 2011-01-13 10:00:00, 5 2011-01-13 10:10:00, 5 2011-01-13 10:20:00, 5 2011-01-13 10:30:00, 5 2011-01-13 10:40:00, 0 2011-01-13 10:50:00, 1 2011-01-13 11:00:00, 1 2011-01-13 11:10:00, 1 2011-01-13 11:20:00, 0 2011-01-13 11:30:00, 5 2011-01-13 11:40:00, 5 2011-01-13 11:50:00, 3 2011-01-13 12:00:00, 3 2011-01-13 12:10:00, 3

How can I choose the first date for each set of identical numbers, so I get the following:

 2011-01-13 10:00:00, 5 2011-01-13 10:40:00, 0 2011-01-13 10:50:00, 1 2011-01-13 11:20:00, 0 2011-01-13 11:30:00, 5 2011-01-13 11:50:00, 3

I deal with sub-requests, etc. for most of the day, and for some reason I can’t crack it. I am sure there is an easy way!

I would probably want to exclude 0 from the results, but that is not important right now.

+6

sql database

Mark Jan 13 '11 at 16:03

source share

4 answers

PostgreSQL supports window functions, see this

[EDIT] Try the following:

 SELECT TableDTM, Code FROM ( SELECT TableDTM, Code, LAG(Code, 1, NULL) OVER (ORDER BY TableDTM) AS PrevCode FROM T ) WHERE PrevCode<>Code OR PrevCode IS NULL;

+1

vc 74 Jan 13 '11 at 16:13

source share

Try the following:

 SELECT MIN(TableDTM) TableDTM, Code FROM ( SELECT T1.TableDTM, T1.Code, MIN(T2.TableDTM) XTableDTM FROM T T1 LEFT JOIN T T2 ON T1.TableDTM <= T2.TableDTM AND T1.Code <> T2.Code GROUP BY T1.TableDTM, T1.Code ) X GROUP BY XTableDTM, Code ORDER BY 1;

+1

sqlvogel Jan 13 '11 at 17:01

source share

You could try something like

 "SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T;"

and if you need to exclude 0, change it to:

  SELECT DISTINCT Code, (SELECT MIN(TableDTM) FROM T AS Q WHERE Q.Code = T.Code) As TableDTM FROM T WHERE Code <> 0;

0

Ass3mbler Jan 13 '11 at 16:10

source share

Performancedba · Accepted Answer · 2011-01-13T23:34:00+0000

Changed Jan 15 eleven

I'm sure there is an easy way

Yes there is. But first two questions.

The table is not a relational database table. It does not have a unique key, which is required by RM and Normalization (in particular, that each line must have a unique identifier, not necessarily PK). Therefore, SQL, the standard language for working with relational database tables, cannot perform basic operations on it.
- it is a heap (data structure inserted and deleted in chronological order), with records not strings.
- any and all operations using SQL will be terribly slow and will not be correct
- Set ROWCOUNT to 1, execute row processing, and SQL will work with a bunch only fine
- Your best bet is to use any unix utiliy to work on it (awk, cut, chop). They blind quickly. The awk script required to answer your request will take 3 minutes to write and it will work in seconds for millions of records (I wrote a few last week).
  ,
So, the question is really SQL, to find the first occurrence of datasets in a non-relational heap .
Now, if your question was SQL, to find the first occurrence of datasets in a relational table , implying, of course, a unique row identifier that would be (a) easy in SQL and (b) fast in every taste of SQL ...
- except Oracle, which is known to handle sub-queries poorly (in particular, Tony Andrews comments, he is a well-known authority in Oracle). In this case, use Materialized Views.
  ,
The question is very general (no complaint). But many of these specific needs are usually applied in a wider context, and the context has requirements not found in the specification here. Usually, a simple subquery is needed (but Oracle uses a Materialized View to avoid a subquery). And the subquery also depends on the external context, the external request. Therefore, the answer to a small general question will not contain an answer to an actual specific need.

In any case, I do not want to avoid the question. Why do not we use an example of the real world, and not a simple general one? and find the first or last occurrence, or the minimum or maximum value of a dataset within another dataset in a Relational table ?

Main request

Let me use the ▶ Data Model ◀ from the previous question.

Report all Alerts from a specific date with a peak value for durations that are not Acknowledged

Since you will use exactly the same technique (with different table and column names) for all your temporal and historical requirements, you need to fully understand the basic design of the subquery and its various applications.

Introduction

Please note that you have not only a clean 5NF database with relational identifiers (composite keys), but also a full time capability, and the time requirement is met without violating 5NF (without update anomalies), which means ValidToDateTime for periods and durations, and not duplicated in data. The point that complicates things, therefore, is not the best example for a subquery tutorial .

Remember that the SQL engine is a processor with many processors, so we come to the problem with set-oriented thinking.
- do not shut off the engine before processing the strings; So slow
- and, more importantly, unnecessary
Subqueries are normal SQL. The syntax I use is direct ISO / IEC / ANSI SQL.
- If you cannot code subqueries in SQL, you will be very limited; and then you need to introduce data duplication or use large result sets as materialized representations or temporary tables or all kinds of additional data and additional processing that will be slower to very slow , not to mention completely unnecessary
- if there is always something you can’t do in a really relational database (and also in my data models) without switching to string or inline views or temp tables, ask for help, that's what you did here.
You need to fully understand the first subquery (easier) before trying to understand the second; and etc.

Method

First create an external query using minimal joins, etc. based on the structure of the result set you need and nothing more. It is very important to first allow the structure of the external request; otherwise, you will go back and forth trying to make the subquery a suitable external query, and vice versa.

It also requires a subquery. So leave this part for now and pick it up later. At the moment, an external request receives all (not unconfirmed) Alerts after a certain date

▶ SQL code ◀ is required on page 1 (sorry, the SO editing features are terrible, it destroys the formatting, and the code is already formatted).

Then create a subquery to populate each cell.

Subquery (1) Alert.Value Output

This is a simple derived point, select Value from Reading , which generated Alert . The tables are connected to each other, the power is 1 :: 1, so this is a direct connection to the PC.

The type of subquery that is required here is the Related subquery , we need to map the table in the external query to the table in the (internal) subquery.
- to do this, we need an alias for the table in the Outer query to map it to the table in the subquery.
- to make a distinction, I used aliases only for such necessary correlation, and fully qualified names for simple joins
Subqueries are very fast in any engine (except Oracle)
SQL is a cumbersome language. But that’s all we have. So get used to it.

▶ SQL code required on page 2.

I deliberately gave you a combination of joins in Outer Query and retrieving data through Subquery so you can find out (you could get Alert.Value through the connection one by one, but that would be even more cumbersome).

The next Subquery we need to get Alert.PeakValue . To do this, we need to determine the time duration of the Alert . We have the beginning of Alert Duration; we need to define the end of the Duration, which is the next (temporary) Reading.Value , which is within the range . This also requires a subquery, which we process better first.

Work with logic from the inside out. Good old BODMAS.

Subquery (2) Alert.EndDtm Output

A slightly more complicated Suquery is to select the first Reading.ReadingDtm , which is greater than or equal to Alert.ReadingDtm , which has Reading.Value , which is less than or equal to its Sensor.UpperLimit .

5NF Time Processing

To handle temporary requirements in the 5NF database (in which EndDateTime is not stored, as well as duplicate data), we only work with StartDateTime , and EndDateTime - : this is next StartDateTime . This is a temporary concept of Duration .

Technically, this is one millisecond (regardless of the resolution for Datatype) less.
However, to be reasonable, we can talk and report EndDateTime just like Next.StartDateTime and ignore one millisecond.
The code should always use = This.StartDateTime and < Next.StartDateTime .
- This fixes a lot of preventable errors.
- Please note that these comparison operators, which copy the time duration and should be used in the usual way in accordance with the above, are completely independent from similar comparison operators related to business logic, for example. Sensor.UpperLimit (i.e. keep an eye on it because both are often located in the same WHERE and are easy to mix or confuse).

▶ ▶ SQL code ◀ is required, as well as the test data used on page 3.

Subquery (3) Alert.PeakValue Output

Now it is easy. Select MAX(Value) from Readings between Alert.ReadingDtm and Alert.EndDtm , Alert duration.

▶ SQL code ◀ required on page 4.

Scalar subquery

In addition to the Correlated subqueries above, all Scalar subqueries as they return the same value; each cell in the grid can be filled with only one value. (Non-scalar subqueries that return multiple values are legal, but not for the above.)

Subquery (4) Verified Alerts

So, now that you have the descriptor of the above correlated scalar subqueries, those that fill the cells in the set, the set that is defined by the external query, allow you to see the subquery that can be used to limit the external query, We really do not want all Alerts (see Above), we want Un-Acknowledged Alerts : Identifiers that exist in Alert , which do not exist in Acknowledgement . This does not populate cells that change the contents of the external set. Of course, this means changing the WHERE .

We do not change the structure external set, so there are no changes to FROM and existing WHERE clauses.

Just add the WHERE to exclude the Acknowledged Alerts set. 1 :: 1 power, direct correlated connection.

Required ▶ SQL code ◀ is on page 5.

The difference is that this is a non-scalar subquery that creates a set of rows (single column). We have a whole set of Alerts (an external set) that is mapped to a whole set of Alerts .

Correspondence is processed because we told the engine that the Subquery is Correlated using an alias (no cumbersome joins needed)
Use 1 because we are doing an existence check. Visualize it as a column added to the Alert set specified by an external query.
Never use *, because we don’t need the whole set of columns, and it will be slower
Similarly, without using correlation, WHERE NOT IN () is required, but again, which builds a specific set of columns, then compares the two sets. Much slower.

Subquery (5) Actioned Alerts

As an alternative restriction for an external request, for un-actioned Alerts instead of (4), exclude the Actioned Alerts set. Direct correlated compound.

Required ▶ SQL code ◀ is on page 5.

This code has been tested on Sybase ASE 15.0.3 using 1000 Alerts and 200 Acknowledgements various combinations; and Readings and Alerts referenced in the document. The execution time is zero millisecond (resolution 0.003 seconds) for all executions.

If you need it, here is ▶ SQL code in text format ◀ .

Reply to comments

(6) ▶ Register alert when reading ◀
This code is executed in a loop (provided), selecting new Readings that are out of range and creating Alerts , unless applicable Alerts already exists.

(7) ▶ Download Reading Alert ◀
Given that you have a complete set of test data for Reading , this code uses a modified form (6) to load the applicable Alerts .

a common problem

It is “easy” when you know how to do it. I repeat, writing SQL without the ability to write subqueries is very limited; this is important for relational database processing, for which SQL was developed.

Half of the reasons developers implement non-normalized data heaps (massive data duplication) is because they cannot write the subqueries needed for normalized structures
- this does not mean that they are “denormalized for productivity”; it is that they cannot encode Normalized. I saw it a hundred times.
- The point is that you have a fully normalized relational database, and the complexity is encoded, and you considered duplicating tables for processing purposes.
And this does not mean the added complexity of a temporary database; or the 5NF temporary database.
Normalization means Never duplicate anything , better known as Don't Repeat Yourself
Suqueries master and you will be in the 98th percentile: normalized, true relational databases; zero data duplication; very high performance.

I think you can figure out the remaining queries.

Relational id

Please note that this example also demonstrates the power of using Relational identifiers , since several tables between the ones we want should not be joined (yes, however, this relational identifiers means less, no more, unites than Id keys). Just follow the solid lines.

Your time requirement requires keys containing a DateTime . Imagine that you are trying to copy the code above using Id PKs, there will be two levels of processing: one for connections (and there will be many more), and the other for processing data.

Label

I try to stay away from colloquial labels ("nested", "internal", etc.) because they are not specific and adhere to certain technical terms. For completeness and understanding:

The subquery after the FROM is a Materialized view , a set of results obtained in one query, and then passed to the FROM other query as a “table”.
- Types Oracle calls this inline view.
- In most cases, you can write correlated subqueries as materialized views, but this is much more than I / O and processing (since Oracles subquery processing is absurd, only for Oracle, materialized views are “faster”).
  ,
A subquery in a WHERE is a Predicate Subquery because it modifies the contents of the result set (on which it is based). It can return Scalar (one value) or not Scalar (many values).
- for scalars use WHERE column = or any scalar operator
- for non-scalars, use WHERE [NOT] EXISTS or WHERE column [NOT] IN

Suquery in a WHERE does not have to be correlated; The following works are simply beautiful. Identify all the extra appendages:

 SELECT [Never] = FirstName, [Acted] = LastName FROM User WHERE UserId NOT IN ( SELECT DISTINCT UserId FROM Action )

SQL to find the first occurrence of datasets in a table

Changed Jan 15 eleven

Introduction

Method

Reply to comments

a common problem

Relational id

Label

More articles: