How can I speed up queries on huge data warehouse tables using efficient data?

Question

How can I speed up queries on huge data warehouse tables using efficient data?

So, I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time some data changes, rather than updating existing records. In fact, its transactional tables are also a data warehouse.

This requires queries that have selections nested in them in order to get the most recent / current row. They are effective and for every day (discarded by the day), they can have an effective sequence. Thus, in order to get the current record for user_id=123 , I have to do this:

 select * from sometable st where st.user_id = 123 and st.effective_date = (select max(sti.effective_date) from sometable sti where sti.user_id = st.user_id) and st.effective_sequence = (select max(sti.effective_sequence) from sometable sti where sti.user_id = st.user_id and sti.effective_date = st.effective_date)

There is a phenomenal number of indexes in these tables, and I cannot find anything else that will speed up my queries.

My problem is that I often want to get data about individual of these tables, maybe 50 user_id, but when I join my tables having only a few records in them with several of these PeopleSoft tables, everything just goes to shit.

PeopleSoft tables are in a remote database, which I access through a database link. My queries tend to look like this:

 select st.* from local_table lt, sometable@remotedb st where lt.user_id in ('123', '456', '789') and lt.user_id = st.user_id and st.effective_date = (select max(sti.effective_date) from sometable@remotedb sti where sti.user_id = st.user_id) and st.effective_sequence = (select max(sti.effective_sequence) from sometable@remotedb sti where sti.user_id = st.user_id and sti.effective_date = st.effective_date)

Things get even worse when I have to join multiple PeopleSoft tables with my local table. Performance is simply unacceptable.

What can I do to improve performance? I tried query hints so that the first local table is attached to her partner in PeopleSoft, so she does not try to join all her tables together before narrowing it down to the correct user_id. I tried the LEADING hint and played LEADING with hints that tried to drag processing to a remote database, but the explanation plan was closed and simply said 'REMOTE' for several operations, and I had no idea what was going on.

Assuming I have no way to change PeopleSoft and the layout of my tables, does my best choice suggest? If I joined a local table with four remote tables, and the local table joined two of them, how would I format the tooltip to my local table (which is very small - in fact, I can just do an inline scan so that my local table was only user_id, which interests me) is connected first to each of the deleted ones?

EDIT: An application needs real-time data, so unfortunately, a materialized view or other method of caching data will not be enough.

+6

sql oracle data-warehouse peoplesoft

aw crud Jun 24 '10 at 19:59

source share

10 answers

One approach is to bind PL / SQL functions to everything. As an example

 create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10)); create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10)); . / create type typ_tab_remote as table of typ_remote; . / insert into remote values (1, date '2010-01-02', 1, 'a'); insert into remote values (1, date '2010-01-02', 2, 'b'); insert into remote values (1, date '2010-01-02', 3, 'c'); insert into remote values (1, date '2010-01-03', 1, 'd'); insert into remote values (1, date '2010-01-03', 2, 'e'); insert into remote values (1, date '2010-01-03', 3, 'f'); insert into remote values (2, date '2010-01-02', 1, 'a'); insert into remote values (2, date '2010-01-02', 2, 'b'); insert into remote values (2, date '2010-01-03', 1, 'd'); create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is CURSOR c_1 is SELECT user_id, eff_date, eff_seq, value FROM (select user_id, eff_date, eff_seq, value, rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk from remote where user_id in (i_user_id_1,i_user_id_2)) WHERE rnk = 1; begin for c_rec in c_1 loop pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value)); end loop; return; end; / select * from table(show_remote(1,null)); select * from table(show_remote(1,2));

Instead of passing user_id separately as parameters, you can load them into a local table (for example, a global temporary table). PL / SQL will cycle through the table, making remote choices for each row in the local table. No query has both local and remote tables. Effectively you would write your own connection code.

+4

Gary myers Jun 25 '10 at 1:45

source share

One option is to materialize the remote part of the query first using a common table expression so that you can make sure that only the remote data is retrieved from the remote db. Another improvement would be to merge the two subqueries with the remote db into one analytic function-based subquery. Such a request can also be used in your current request. I can make other suggestions only after playing with db.

see below

 with remote_query as ( select /*+ materialize */ st.* from sometable@remotedb st where st.user_id in ('123', '456', '789') and st.rowid in( select first_value(rowid) over (order by effective_date desc, effective_sequence desc ) from sometable@remotedb st1 where st.user_id=st1.user_id) ) select lt.*,st.* FROM local_table st,remote_query rt where st.user_id=rt.user_id

+3

josephj1989 Jun 24 '10 at 21:21

source share

You did not specify requirements for the freshness of the data, but one of them is the creation of materialized views (you will be limited by REFRESH COMPLETE, since you cannot create snapshot logs in the source system) that have data only for the current version of the transaction line. These materialized view tables will reside on your local system, and additional indexing can be added to improve query performance.

+1

dpbradley Jun 24 '10 at 20:22

source share

The performance issue is access by reference. With the query part of the local tables, everything is done locally, so there is no access to the remote indexes, and it returns all the deleted data to check lkocally.

If you can use materialized representations in a local database updated from a peoples database on a periodic (nightly) basis for historical data, only access to the remote peoples database is currently changing (adding effective_date = today to the whereplace ad) and merging two queries.

Another option would be to use INSERT INTO X SELECT FROM only for remote data to pull it into a temporary local table or materialized view, and then a second query to join your local data ... similar to josephj1989's suggestion

Alternatively (although licensing problems may occur) try RAC Clustering your local db with the remote version of peopleoft db.

+1

Mark baker Jun 24 '10 at 20:57

source share

Instead of using subqueries, you can try this. I do not know whether Oracle will work with this or not, since I do not use Oracle much.

 SELECT ST1.col1, ST1.col2, ... FROM Some_Table ST1 LEFT OUTER JOIN Some_Table ST2 ON ST2.user_id = ST1.user_id AND ( ST2.effective_date > ST1.effective_date OR ( ST2.effective_date = ST1.effective_date AND ST2.effective_sequence > ST1.effective_sequence ) ) WHERE ST2.user_id IS NULL

Another possible solution:

 SELECT ST1.col1, ST1.col2, ... FROM Some_Table ST1 WHERE NOT EXISTS ( SELECT FROM Some_Table ST2 WHERE ST2.user_id = ST1.user_id AND ( ST2.effective_date > ST1.effective_date OR ( ST2.effective_date = ST1.effective_date AND ST2.effective_sequence > ST1.effective_sequence ) ) )

0

Tom h Jun 24 '10 at 20:09

source share

Would it be possible to create a database that you use for non-warehousing materials that you can update at night? If you can create a night process that will only navigate through the most recent entries. This will save you the MAX you make for daily queries and significantly reduce the number or records.

It also depends on whether you have a 1 day gap between the most recent and available data.

I am not very familiar with Oracle, so there may be a way to get improvements by making changes to your request ...

0

Abe miessler Jun 24 '10 at 20:11

source share

Can you use ETL rows with the desired user_id in your own table, creating only the necessary indexes to support your queries and execute your queries on it?

0

Frank R. Jun 24 '10 at 21:46

source share

Is the PeopleSoft table supplied, or is it normal? Are you sure this is a physical table and not a poorly written view on the PS side? If this is a supplied record that you are trying against (the example looks the same as PS_JOB or the view that references it), perhaps you can indicate this. PS_JOB is a beast with many indexes, and most sites add even more.

If you know the indexes in the table, you can use Oracle hints to indicate the preferred index to use; which sometimes helps.

You made a plan of explanations to determine if you can determine where the problem is. Maybe there is a Cartesian join, a full table scan, etc.?

0

Chip l Jun 24 '10 at 23:49

source share

It seems to me that you are dealing with a type 2 dimension in a data warehouse. There are several ways to implement a type 2 dimension, mainly having columns of type ValidFrom, ValidTo, Version, Status . Not all of them are always present, it would be interesting if you could publish a schema for your table. Here's an example of how it might look (John Smith moved from Indiana to Ohio on 2010-06-24)

 UserKey UserBusinessKey State ValidFrom ValidTo Version Status 7234 John_Smith_17 Indiana 2005-03-20 2010-06-23 1 expired 9116 John_Smith_17 Ohio 2010-06-24 3000-01-01 2 current

To get the latest version of a string, usually used

 WHERE Status = 'current'

or

 WHERE ValidTo = '3000-01-01'

Please note that in the future it has some constants.

or

 WHERE ValidTo > CURRENT_DATE

It seems your example uses ValidFrom (effective_date), so you need to find max() to find the last line. Take a look at the diagram - are there Status or ValidTo equivalents in your tables?

0

Damir sudarevic Jun 25 '10 at 10:45

source share

Dcookie · Accepted Answer · 2010-06-24T21:19:07+0000

Does your request repeat something like this help?

 SELECT * FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt, MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id) WHERE lt.user_id in ('123', '456', '789')) WHERE effective_date = max_dt AND effective_seq = max_seq;

I agree with @Mark Baker that the performance associated with DB Links can really suck, and you will probably be limited by what you can do with this approach.

How can I speed up queries on huge data warehouse tables using efficient data?

More articles: