SQL ranking query to calculate rank and median in subgroups

I want to compute Median from y in the subgroups of this simple xy_table :

  x | y --groups--> gid | x | y --medians--> gid | x | y ------- ------------- ------------- 0.1 | 4 0.0 | 0.1 | 4 0.0 | 0.1 | 4 0.2 | 3 0.0 | 0.2 | 3 | | 0.7 | 5 1.0 | 0.7 | 5 1.0 | 0.7 | 5 1.5 | 1 2.0 | 1.5 | 1 | | 1.9 | 6 2.0 | 1.9 | 6 | | 2.1 | 5 2.0 | 2.1 | 5 2.0 | 2.1 | 5 2.7 | 1 3.0 | 2.7 | 1 3.0 | 2.7 | 1 

In this example, each x is unique and the table is already sorted by x . Now I want GROUP BY round(x) and get a tuple that contains the median y in each group.

I can already calculate the median for the entire table with this ranking query :

 SELECT ax, ay FROM xy_table a,xy_table b WHERE ay >= by GROUP BY ax, ay HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table) 

Output: 0.1, 4.0

But I have not yet managed to write a query to calculate the median for subgroups.

Note: I do not have the median() aggregation function. Also, do not offer solutions with the special PARTITION , RANK or QUANTILE (as in similar questions, but too specific for the provider, https://stackoverflow.com/a/4649/ ). I need plain SQL (i.e. SQLite compatible without median() function)

Edit: I was really looking for Medoid , not Median .

+8
sql sqlite group-by median ranking
source share
2 answers

I suggest doing calculations in your programming language:

 for each group: for each record_in_group: append y to array median of array 

But if you are stuck in SQLite, you can order each group with y and select the records in the middle, as shown at http://sqlfiddle.com/#!5/d4c68/55/0 :

UPDATE : only the larger value of the median value is importand even even nr. lines, so avg() not required:

 select groups.gid, ids.y median from ( -- get middle row number in each group (bigger number if even nr. of rows) -- note the integer divisions and modulo operator select round(x) gid, count(*) / 2 + 1 mid_row_right from xy_table group by round(x) ) groups join ( -- for each record get equivalent of -- row_number() over(partition by gid order by y) select round(ax) gid, ax, ay, count(*) rownr_by_y from xy_table a left join xy_table b on round(ax) = round (bx) and ay >= by group by ax ) ids on ids.gid = groups.gid where ids.rownr_by_y = groups.mid_row_right 
+3
source share

OK, it depends on the temporary table:

 create temporary table tmp (x float, y float); insert into tmp select * from xy_table order by round(x), y 

But you could create it for a number of data interesting you. Another way would be to ensure that xy_table this sort order, and not just order x . The reason for this is the lack of SQLite line numbering capabilities.

Then:

 select tmp4.x as gid, t.* from ( select tmp1.x, round((tmp2.y + coalesce(tmp3.y, tmp2.y)) / 2) as y -- <- for larger of the two, change to: (case when tmp2.y > coalesce(tmp3.y, 0) then tmp2.y else tmp3.y end) from ( select round(x) as x, min(rowid) + (count(*) / 2) as id1, (case when count(*) % 2 = 0 then min(rowid) + (count(*) / 2) - 1 else 0 end) as id2 from ( select *, rowid from tmp ) t group by round(x) ) tmp1 join tmp tmp2 on tmp1.id1 = tmp2.rowid left join tmp tmp3 on tmp1.id2 = tmp3.rowid ) tmp4 join xy_table t on tmp4.x = round(tx) and tmp4.y = ty 

If you want to consider the median as the larger of the two average values, which does not meet the definition as @Aprillion already indicated, then you simply take the larger of the two y values, and not their average, in the third line of the query.

0
source share

All Articles