Consider one entry, for each identifier, in a group

Question

Consider one entry, for each identifier, in a group

Background

I have a 4-column SQL table:

id - varchar (50)
g1 - varchar (50)
g2 - varchar (50)
datetime - timestamp

I have this query:

 SELECT g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM my_table and g2 = 'start' GROUP BY 1 order by share desc

This query was built to answer: What are the distribution of g1 values among users?

Problem

Each id can have several records in the table. I want to consider the earliest of them. early means the minimum datetime value.

Example

Table

 id g1 g2 datetime x1 a start 2016-01-19 21:01:22 x1 c start 2016-01-19 21:01:21 x2 b start 2016-01-19 09:03:42 x1 a start 2016-01-18 13:56:45

Actual query results

 g1 count total share a 2 4 0.5 b 1 4 0.25 c 1 4 0.25

we have 4 entries, but I only want to consider two entries:

 x2 b start 2016-01-19 09:03:42 x1 a start 2016-01-18 13:56:45

which are the earliest entries for id .

Expected Query Results

 g1 count total share a 1 2 0.5 b 1 2 0.5

Question

How can I consider only the earliest entry for id in group by

+6

sql group-by

idanshmu 20 sept '16 at 5:33

source share

4 answers

I don’t know what your DBMS is, so here is the standard ANSI way to do this.

 SELECT T1.g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM my_table T1 INNER JOIN (SELECT id, MIN(datetime) AS mindt FROM mytable GROUP BY id ) T2 ON T1.datetime=t2.mindt AND T1.id=T2.id and T1.g2 = 'start' GROUP BY 1 order by share desc

This can be slow if you have a large table and datetime not indexed.

+2

Thomas g 20 sept '16 at 5:43

source share

Here is a solution that should work in SQL Server and any database that supports CTE:

 WITH cte AS ( SELECT t1.g1, COUNT(*) AS count FROM yourTable t1 INNER JOIN ( SELECT id, MIN(datetime) AS datetime FROM yourTable GROUP BY id ) t2 ON t1.id = t2.id AND t1.datetime = t2.datetime ) SELECT t.g1, t.count, (SELECT COUNT(*) FROM cte) AS total, t.count / (SELECT COUNT(*) FROM cte) AS share FROM cte t

+2

Tim biegeleisen 20 sept '16 at 5:44

source share

Try the query below.

 ;WITH cte_1 as (SELECT id, MIN(datetime) AS [Date] FROM YourTable GROUP BY id ) SELECT yt.g1, COUNT(DISTINCT yt.id) [Count], SUM(COUNT(DISTINCT yt.id)) OVER () AS total, (CAST(COUNT(DISTINCT yt.id) AS float) / SUM(COUNT(DISTINCT yt.id)) OVER ()) AS share FROM cte_1 c JOIN YourTable yt ON yt.[datetime]=c.[Date] AND yt.id=c.id and yt.g2 = 'start' GROUP BY yt.g1 ORDER BY share DESC

Exit:

+2

Unnikrishnan r 20 sept '16 at 5:58

source share

Elyasin · Accepted Answer · 2016-09-20T05:54:50+0000

You request all data from my_table , although you want to have the earliest date for id . I assume id is the primary key in the table.

I suggest you define a view (or inline view) that only queries the earliest dates for id , and you use your query in that view, not my_table.

A view can be defined as such and will only contain the id the earliest date:

 select * from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start'

You can define this as a view or use it directly in a string, as in:

 SELECT g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM (select a.id, a.g1, a.g2, a.datetime from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start') GROUP BY 1 order by share desc

Consider one entry, for each identifier, in a group

Background

Problem

Example

Table

Actual query results

Expected Query Results

Question

More articles: