How to split a massive data query into multiple queries

I need to select all rows from a table with millions of rows (to preload the Coherence datagrid binding.) How do I split this query into multiple queries that can be executed simultaneously by multiple threads?

At first I thought about getting an account of all the entries and doing:

SELECT ... WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize) 

but it didn’t work. Now i'm stuck.

Any help would be greatly appreciated.

+4
source share
4 answers

If you have an Enterprise Edition license, the easiest way to achieve this is to query in parallel.

For one-time or special requests, use the PARALLEL hint:

 select /*+ parallel(your_table, 4) */ * from your_table / 

The number in the prompt is the number of slave queries you want to complete; in this case, the database will start four threads.

If you want each query issued in a table to be parallelized, permanently change the definition of the table:

 alter table your_table parallel (degree 4) / 

Note that the database will not always use a parallel query; the optimizer will decide if this is appropriate. A parallel query only works with a full table scan or a scan of index ranges that cross several sections.

There are a number of caveats. A parallel request requires a sufficient number of cores to satisfy the proposed number of threads; if we have only one dual-core processor, a parallel power of 16 will not magically speed up the request. In addition, we need spare processor power; if the server is already connected to the CPU, then parallel execution will only worsen the situation. Finally, the I / O and storage subsystems must be able to satisfy a simultaneous need; SANs may be worthless here.

As always with regard to performance, it is extremely important to conduct some benchmarking against realistic amounts of data in a representative environment before going into production.


What if you do not have an Enterprise Edition? Well, you can simulate parallel execution manually. Tom Keith calls it "Do-It-Yourself Parallelism". I used this technique myself and it works well.

The main thing is to work out the full range of ROWIDs that apply to the table and divide them into several tasks. Unlike some other solutions offered in this thread, each task selects only the rows it needs. Mr. Kite summarized this technique in the old AskTom branch, including the vital split script: find it here .

Splitting a table and starting threads is a manual task: thin as a one-time, but rather tedious to spend often. Therefore, if you are using 11g release 2, you should know that there is a new PL / SQL package DBMS_PARALLEL_EXECUTE that automates this for us.

+5
source

Are you sure that parallel query execution will be faster? This will only happen if the huge table is stored on a disk matrix with many disks or if it is divided into several disks. In all other cases, sequential access to the table will be many times faster.

If you really need to split the request, you must split it so that sequential access for each part is still possible. Send the table dlls so that we can give a specific answer.

If data processing or uploading to a data grid is a bottleneck, then you better read the data using one process and split the data before further processing.

Assuming reading is fast and further processing the data is a bottleneck, you could extract the data from the read and write it to very simple text files (such a fixed length or CSV). After every 10,000 lines, you start a new file and create a thread or process to process the just finished file.

+3
source

try something like this:

 select * from ( select a.*, ROWNUM rnum from ( <your_query_goes_here, with order by> ) a where ROWNUM <= :MAX_ROW_TO_FETCH ) where rnum >= :MIN_ROW_TO_FETCH; 
0
source

Have you considered using MOD 10 on ROWNUM to pull data a tenth at a time?

 SELECT A.* FROM Table A WHERE MOD(ROWNUM,10) = 0; 
-1
source

All Articles