Consider one entry, for each identifier, in a group

Background

I have a 4-column SQL table:

  • id - varchar (50)
  • g1 - varchar (50)
  • g2 - varchar (50)
  • datetime - timestamp

I have this query:

 SELECT g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM my_table and g2 = 'start' GROUP BY 1 order by share desc 

This query was built to answer: What are the distribution of g1 values ​​among users?

Problem

Each id can have several records in the table. I want to consider the earliest of them. early means the minimum datetime value.

Example

Table

 id g1 g2 datetime x1 a start 2016-01-19 21:01:22 x1 c start 2016-01-19 21:01:21 x2 b start 2016-01-19 09:03:42 x1 a start 2016-01-18 13:56:45 

Actual query results

 g1 count total share a 2 4 0.5 b 1 4 0.25 c 1 4 0.25 

we have 4 entries, but I only want to consider two entries:

 x2 b start 2016-01-19 09:03:42 x1 a start 2016-01-18 13:56:45 

which are the earliest entries for id .

Expected Query Results

 g1 count total share a 1 2 0.5 b 1 2 0.5 

Question

How can I consider only the earliest entry for id in group by

+6
source share
4 answers

You request all data from my_table , although you want to have the earliest date for id . I assume id is the primary key in the table.

I suggest you define a view (or inline view) that only queries the earliest dates for id , and you use your query in that view, not my_table.

A view can be defined as such and will only contain the id the earliest date:

 select * from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start' 

You can define this as a view or use it directly in a string, as in:

 SELECT g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM (select a.id, a.g1, a.g2, a.datetime from my_table a where a.datetime = (select min(z.datetime) from my_table z where a.id = z.id) and a.g2 = 'start') GROUP BY 1 order by share desc 
+1
source

I don’t know what your DBMS is, so here is the standard ANSI way to do this.

 SELECT T1.g1, COUNT(DISTINCT id), SUM(COUNT(DISTINCT id)) OVER () AS total, (CAST(COUNT(DISTINCT id) AS float) / SUM(COUNT(DISTINCT id)) OVER ()) AS share FROM my_table T1 INNER JOIN (SELECT id, MIN(datetime) AS mindt FROM mytable GROUP BY id ) T2 ON T1.datetime=t2.mindt AND T1.id=T2.id and T1.g2 = 'start' GROUP BY 1 order by share desc 

This can be slow if you have a large table and datetime not indexed.

+2
source

Here is a solution that should work in SQL Server and any database that supports CTE:

 WITH cte AS ( SELECT t1.g1, COUNT(*) AS count FROM yourTable t1 INNER JOIN ( SELECT id, MIN(datetime) AS datetime FROM yourTable GROUP BY id ) t2 ON t1.id = t2.id AND t1.datetime = t2.datetime ) SELECT t.g1, t.count, (SELECT COUNT(*) FROM cte) AS total, t.count / (SELECT COUNT(*) FROM cte) AS share FROM cte t 
+2
source

Try the query below.

 ;WITH cte_1 as (SELECT id, MIN(datetime) AS [Date] FROM YourTable GROUP BY id ) SELECT yt.g1, COUNT(DISTINCT yt.id) [Count], SUM(COUNT(DISTINCT yt.id)) OVER () AS total, (CAST(COUNT(DISTINCT yt.id) AS float) / SUM(COUNT(DISTINCT yt.id)) OVER ()) AS share FROM cte_1 c JOIN YourTable yt ON yt.[datetime]=c.[Date] AND yt.id=c.id and yt.g2 = 'start' GROUP BY yt.g1 ORDER BY share DESC 

Exit:

enter image description here

+2
source

All Articles