Effectively find top-N values from multiple columns independently in Oracle

Question

Effectively find top-N values from multiple columns independently in Oracle

Suppose I have 30 billion rows with multiple columns, and I want to efficiently find the highest N most frequent values for each column independently and with the most elegant SQL. For example, if I have

FirstName LastName FavoriteAnimal FavoriteBook --------- -------- -------------- ------------ Ferris Freemont Possum Ubik Nancy Freemont Lemur Housekeeping Nancy Drew Penguin Ubik Bill Ribbits Lemur Dhalgren

and I want top-1, then the result will be:

 FirstName LastName FavoriteAnimal FavoriteBook --------- -------- -------------- ------------ Nancy Freemont Lemur Ubik

I probably think about how to do this, but I'm not sure how optimal they are, which is important when there are 30 billion lines; and SQL can be large and ugly, and perhaps there will be too much temporary space.

Using Oracle.

+4

sql oracle

Joda Sep 2 '11 at 3:16

source share

6 answers

The best I have come up with so far with pure Oracle SQL is similar to what @AlexPoole did. I use count (A), not count (*), to put zeros at the bottom.

 with NUM_ROWS_RETURNED as ( select 4 as NUM from dual ), SAMPLE_DATA as ( select /*+ materialize */ A,B,C,D,E from ( select 1 as A, 1 as B, 4 as C, 1 as D, 4 as E from dual union all select 1 , -2 , 3 , 2 , 3 from dual union all select 1 , -2 , 2 , 2 , 3 from dual union all select null , 1 , 1 , 3 , 2 from dual union all select null , 2 , 4 , null , 2 from dual union all select null , 1 , 3 , null , 2 from dual union all select null , 1 , 2 , null , 1 from dual union all select null , 1 , 4 , null , 1 from dual union all select null , 1 , 3 , 3 , 1 from dual union all select null , 1 , 4 , 3 , 1 from dual ) ), RANKS as ( select /*+ materialize */ rownum as RANKED from SAMPLE_DATA where rownum <= (select min(NUM) from NUM_ROWS_RETURNED) ) select r.RANKED, max(case when A_RANK = r.RANKED then A else null end) as A, max(case when B_RANK = r.RANKED then B else null end) as B, max(case when C_RANK = r.RANKED then C else null end) as C, max(case when D_RANK = r.RANKED then D else null end) as D, max(case when E_RANK = r.RANKED then E else null end) as E from ( select A, dense_rank() over (order by A_COUNTS desc) as A_RANK, B, dense_rank() over (order by B_COUNTS desc) as B_RANK, C, dense_rank() over (order by C_COUNTS desc) as C_RANK, D, dense_rank() over (order by D_COUNTS desc) as D_RANK, E, dense_rank() over (order by E_COUNTS desc) as E_RANK from ( select A, count(A) over (partition by A) as A_COUNTS, B, count(B) over (partition by B) as B_COUNTS, C, count(C) over (partition by C) as C_COUNTS, D, count(D) over (partition by D) as D_COUNTS, E, count(E) over (partition by E) as E_COUNTS from SAMPLE_DATA ) ) cross join RANKS r group by r.RANKED order by r.RANKED /

gives:

 RANKED| A| B| C| D| E ------|----|----|----|----|---- 1| 1| 1| 4| 3| 1 2|null| -2| 3| 2| 2 3|null| 2| 2| 1| 3 4|null|null| 1|null| 4

with plan:

 -------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 93 | 57 (20)| 00:00:01 | | 1 | TEMP TABLE TRANSFORMATION | | | | | | | 2 | LOAD AS SELECT | | | | | | | 3 | VIEW | | 10 | 150 | 20 (0)| 00:00:01 | | 4 | UNION-ALL | | | | | | | 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 6 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 8 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 9 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 10 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 11 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 12 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 13 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 14 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 15 | LOAD AS SELECT | | | | | | |* 16 | COUNT STOPKEY | | | | | | | 17 | VIEW | | 10 | | 2 (0)| 00:00:01 | | 18 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 150 | 2 (0)| 00:00:01 | | 19 | SORT AGGREGATE | | 1 | | | | | 20 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | | 21 | SORT GROUP BY | | 1 | 93 | 33 (34)| 00:00:01 | | 22 | MERGE JOIN CARTESIAN | | 100 | 9300 | 32 (32)| 00:00:01 | | 23 | VIEW | | 10 | 800 | 12 (84)| 00:00:01 | | 24 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 | | 25 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 | | 26 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 | | 27 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 | | 28 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 | | 29 | VIEW | | 10 | 800 | 7 (72)| 00:00:01 | | 30 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 | | 31 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 | | 32 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 | | 33 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 | | 34 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 | | 35 | VIEW | | 10 | 150 | 2 (0)| 00:00:01 | | 36 | TABLE ACCESS FULL| SYS_TEMP_0FD9| 10 | 150 | 2 (0)| 00:00:01 | | 37 | BUFFER SORT | | 10 | 130 | 33 (34)| 00:00:01 | | 38 | VIEW | | 10 | 130 | 2 (0)| 00:00:01 | | 39 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 130 | 2 (0)| 00:00:01 | -------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 16 - filter( (SELECT MIN(4) FROM "SYS"."DUAL" "DUAL")>=ROWNUM)

But with one of the real tables this looks (for a slightly modified query):

 ---------------------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 422 | | 6026M (1)|999:59:59 | | | | | | | 1 | TEMP TABLE TRANSFORMATION | | | | | | | | | | | | | 2 | LOAD AS SELECT | | | | | | | | | | | | |* 3 | COUNT STOPKEY | | | | | | | | | | | | | 4 | PX COORDINATOR | | | | | | | | | | | | | 5 | PX SEND QC (RANDOM) | :TQ10000 | 10 | | | 2 (0)| 00:00:01 | | | Q1,00 | P->S | QC (RAND) | |* 6 | COUNT STOPKEY | | | | | | | | | Q1,00 | PCWC | | | 7 | PX BLOCK ITERATOR | | 10 | | | 2 (0)| 00:00:01 | 1 | 115 | Q1,00 | PCWC | | | 8 | INDEX FAST FULL SCAN | IDX | 10 | | | 2 (0)| 00:00:01 | 1 | 115 | Q1,00 | PCWP | | | 9 | SORT GROUP BY | | 1 | 422 | | 6026M (1)|999:59:59 | | | | | | | 10 | MERGE JOIN CARTESIAN | | 22G| 8997G| | 6024M (1)|999:59:59 | | | | | | | 11 | VIEW | | 2289M| 872G| | 1443M (1)|999:59:59 | | | | | | | 12 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 13 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 14 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 15 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 16 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 17 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | | | 18 | VIEW | | 2289M| 872G| | 248M (1)|829:16:06 | | | | | | | 19 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 20 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 21 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 22 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 23 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 24 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | | | 25 | PARTITION RANGE ALL| | 2289M| 162G| | 3587K (4)| 11:57:36 | 1 | 115 | | | | | 26 | TABLE ACCESS FULL | LARGE_TABLE | 2289M| 162G| | 3587K (4)| 11:57:36 | 1 | 115 | | | | | 27 | BUFFER SORT | | 10 | 130 | | 6026M (1)|999:59:59 | | | | | | | 28 | VIEW | | 10 | 130 | | 2 (0)| 00:00:01 | | | | | | | 29 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 130 | | 2 (0)| 00:00:01 | | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter(ROWNUM<=10) 6 - filter(ROWNUM<=10)

I could use from LARGE_TABLE sample (0.01) to speed from LARGE_TABLE sample (0.01) up, despite the risk of getting a distorted image. That returned a 53-minute answer to a table with 2 billion rows.

+2

Joda Sep 2 '11 at 12:18

source share

You can not.

There is no trick, just raw work.

You just need to go through each row in the table and count the occurrences of each column that you are interested in, and then sort these results to find the ones that matter the most.

For one column it is easy:

 SELECT col, count(*) FROM table GROUP BY col ORDER BY count(*) DESC

and select the first row.

N columns equals N table scan.

If you write logic and go through the table once, then you count each instance of each value of each column.

If you have 30 billion rows with 30 billion values, you can store them, and they all have a score of 1. And you will get this for every column that you care about.

If this information is important to you, you better track it independently and gradually as your data arrives. But that is another problem.

+1

Will hartung Sep 2 '11 at 4:02

source share

Assuming you don't have many different values in each column, you should do the following:

Create a map for each column that stores counters for each individual value.
Read the entire table (line by line, but only once)
For each row, increment counters
After that, look at your cards and pull out the most commonly used values.

For a single column, SQL will do this:

 select value from ( select value, count(*) from the_table group by value order by count(*) desc ) where rownum < 2

However, if you simply combined several of them into one big SQL, I think that it will scan the table several times (one for each column), which you do not want. Can you get implementation plans for this?

Thus, you probably have to write a program for this, either on the server (PL / SQL or Java, if available), or as a client program.

+1

Thilo Sep 2 '11 at 4:02

source share

Scroll through your entries, keeping in mind the number of times each occurrence of an event for each column of interest.

Each so often (each X-record, or when you have accumulated the amount of data reaching a fixed memory limit), scroll through the amount of memory and increase the corresponding amounts in some disk storages and clear the memory Information.

Details depend on which programming language you are using.

0

ysth Sep 2 '11 at 3:22

source share

Below I took a naive approach. I think that would be completely inoperable for datasets above several hundred thousand. Perhaps the guru can use it as the basis for a more appropriate answer.

How should the current query results be? You can select the results of the “listed below” part in some cache, possibly on a nightly basis.

Then you can make the final choice.

Another possibility is to create a trigger in the table in question, which will update the counter table with each insert / update / delete.

The counter table will look like this:

 field_value count Nancy 2 Bill 1 Ferris 1

You will need to have a table of counters for each field that you would like to count.

In short, I think you need to think about ways to observe this data indirectly . I don’t think that the fact will be around that actual calculations will take a lot of time. But if you have a way to gradually keep track of what has changed, you only need to do a heavy climb once. Then your cache + whats new should provide you with what you need.

 select top 1 firstname, COUNT(*) as freq from ( select 'Ferris' as firstname, 'Freemont' as lastname, 'Possum' as favoriteanimal, 'Ubik' as favoritebook union all select 'Nancy','Freemont','Lemur','Housekeeping' union all select 'Nancy','Drew','Penguin','Ubik' union all select 'Bill','Ribbits','Lemur','Dhalgren' ) sample_data group by firstname order by COUNT(*) desc

0

JosephStyons Sep 2 '11 at 4:06

source share

Alex poole · Accepted Answer · 2011-09-02T09:02:03+0000

This should only go through one table pass. You can use the analytical version of count() to get the frequency of each value independently:

 select firstname, count(*) over (partition by firstname) as c_fn, lastname, count(*) over (partition by lastname) as c_ln, favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa, favoritebook, count(*) over (partition by favoritebook) as c_fb from my_table; FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB ------ ---- -------- ---- ------- ---- ------------ ---- Bill 1 Ribbits 1 Lemur 2 Dhalgren 1 Ferris 1 Freemont 2 Possum 1 Ubik 2 Nancy 2 Freemont 2 Lemur 2 Housekeeping 1 Nancy 2 Drew 1 Penguin 1 Ubik 2

Then you can use this as CTE (or subquery factoring, I think, in oracle terminology) and pull only the highest frequency value from each column:

 with tmp_tab as ( select /*+ MATERIALIZE */ firstname, count(*) over (partition by firstname) as c_fn, lastname, count(*) over (partition by lastname) as c_ln, favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa, favoritebook, count(*) over (partition by favoritebook) as c_fb from my_table) select (select firstname from ( select firstname, row_number() over (partition by null order by c_fn desc) as r_fn from tmp_tab ) where r_fn = 1) as firstname, (select lastname from ( select lastname, row_number() over (partition by null order by c_ln desc) as r_ln from tmp_tab ) where r_ln = 1) as lastname, (select favoriteanimal from ( select favoriteanimal, row_number() over (partition by null order by c_fa desc) as r_fa from tmp_tab ) where r_fa = 1) as favoriteanimal, (select favoritebook from ( select favoritebook, row_number() over (partition by null order by c_fb desc) as r_fb from tmp_tab ) where r_fb = 1) as favoritebook from dual; FIRSTN LASTNAME FAVORIT FAVORITEBOOK ------ -------- ------- ------------ Nancy Freemont Lemur Ubik

You do one CTE pass for each column, but this should still hit only the actual table (thanks to the materialize hint). And you can add order by clauses to customize what to do if there are connections.

This is similar to the concept that Thilo, ysth, and others have suggested, except that you let Oracle keep track of all the counts.

Edit: Hmm, explain that the plan shows that it performs four full table scans; you might have to think about it a bit more ... Edit 2: Adding a hint (undocumented) materialize to the CTE seems to resolve this; it creates a temporary temporary table to store the results and performs only one full table scan. The cost of the explanation plan is higher, although at least at this time the sample data set. Be interested in any comments about any flaws to do this.

Effectively find top-N values ​​from multiple columns independently in Oracle

More articles:

Effectively find top-N values from multiple columns independently in Oracle