SQL random sampling with groups

I have a database of university graduates, and I would like to receive a random sample of data from about 1000 records.

I want the sample to represent the population, so I would like to include the same proportions of the courses, for example

enter image description here

I could do this using the following:

select top 500 id from degree where coursecode = 1 order by newid() union select top 300 id from degree where coursecode = 2 order by newid() union select top 200 id from degree where coursecode = 3 order by newid() 

but we have hundreds of course codes, so it will take a lot of time, and I would like to be able to reuse this code for different sample sizes and especially do not want to go through the query and the hard code of the sample size.

Any help would be greatly appreciated.

+5
source share
4 answers

You need a stratified sample. I would recommend doing this by sorting the data by course code and performing the nth sample. Here is one method that works best if you have a large population:

 select d.* from (select d.*, row_number() over (order by coursecode, newid) as seqnum, count(*) over () as cnt from degree d ) d where seqnum % (cnt / 500) = 1; 

EDIT:

You can also calculate the population size for each group on the fly:

 select d.* from (select d.*, row_number() over (partition by coursecode order by newid) as seqnum, count(*) over () as cnt, count(*) over (partition by coursecode) as cc_cnt from degree d ) d where seqnum < 500 * (cc_cnt * 1.0 / cnt) 
+9
source

Add a table to store population .

I think it should be like this:

 SELECT * FROM ( SELECT id, coursecode, ROW_NUMBER() OVER (PARTITION BY coursecode ORDER BY NEWID()) AS rn FROM degree) t LEFT OUTER JOIN population p ON t.coursecode = p.coursecode WHERE rn <= p.SampleSize 
+1
source

There is no need to split the population at all.

If you accept a sample of 1000 out of a total of hundreds of course codes, then it is reasonable that many of these course codes will not be selected in any sample at all.

If the population is homogeneous (say, a continuous sequence of student identifiers), a uniformly distributed sample will automatically be weighted by code rate. Since newid () is a uniform random sampler, you can exit the box.

The only wrinkle you may encounter is the student ID associated with several course codes. In this case, create a unique list (temporary table or subquery) containing a sequential identifier, student identifier and course code, an approximate sequence identifier from it, grouping by student identifier to remove duplicates.

+1
source

I made similar queries (but not in MS SQL) using the ROW_NUMBER approach:

 select ... from ( select ... ,row_number() over (partition by coursecode order by newid()) as rn from degree ) as d join sample size as s on d.coursecode = s.coursecode and d.rn <= s.samplesize 
0
source

All Articles