How to process huge result sets from a database

Question

How to process huge result sets from a database

I am developing a multi-level database-based web application - SQL relational database, Java for the middle level of service, web interface for the user interface. Language doesn't really matter.

An average service level performs the actual database query. The user interface simply asks for certain data and has no idea that it supports the database.

The question is, how to handle large data sets? The user interface requests data, but the results can be huge, perhaps too large to fit in memory. For example, a street sign application may have a service level:

StreetSign getStreetSign(int identifier) Collection<StreetSign> getStreetSigns(Street street) Collection<StreetSign> getStreetSigns(LatLonBox box)

The user interface layer requests that all street signs meet certain criteria. Depending on the criteria, the result set can be huge. The user interface layer can split the results into separate pages (for the browser) or simply present them all (serving up to Goolge Earth). A potentially huge set of results can be a performance and resource (memory) issue.

One solution is to not return fully loaded objects (StreetSign objects). Rather, return some result set or iterator that lazily loads each individual object.

Another solution is to change the service API to return a subset of the requested data:

 Collection<StreetSign> getStreetSigns(LatLonBox box, int pageNumber, int resultsPerPage)

Of course, the user interface can request a huge set of results:

 getStreetSigns(box, 1, 1000000000)

I am curious what is the standard design pattern for this scenario?

+6

java database web-applications lazy-loading resultset

Steve kuo Oct 23 '08 at 10:35

source share

10 answers

RogueOne · Answer 1 · 2008-10-23T23:39:05+0000

The very first question:

¿Should the user or can manage this amount of data?

Although the result set should be unloaded if its potential size is so huge, the answer will be "probably not", so the user interface should not show it.

I worked on J2EE projects on healthcare systems that process a huge amount of stored data, literally millions of patients, visits, forms, etc., and the general rule is not to show more than 100 or 200 lines for any user, find, tell to the user that the set of criteria provides more information that he can understand.

The way this is implemented varies from one project to another, you can make the user interface ask the service level for the size of the request before running it, or you can exclude an exception from the service level if the result set grows too much (however, this method associates the service level with a limited user interface implementation) .

Be careful! This does not mean that every method at the service level should throw an exception if its result exceeds 100, this general rule applies only to result sets that are displayed to the user directly, which is the best reason for placing a control in the user interface instead at the service level .

Roadwarrior · Answer 2 · 2008-10-23T23:33:55+0000

The most common template I've seen for this situation is a kind of paging, usually performed on the server side, to reduce the amount of information sent over the cable.

Here's an example of SQL Server 2000 using a table variable (usually faster than a temporary table) along with an example of your street signs:

 CREATE PROCEDURE GetPagedStreetSigns ( @Page int = 1, @PageSize int = 10 ) AS SET NOCOUNT ON -- This memory-variable table will control paging DECLARE @TempTable TABLE (RowNumber int identity, StreetSignId int) INSERT INTO @TempTable ( StreetSignId ) SELECT [Id] FROM StreetSign ORDER BY [Id] -- select only those rows belonging to the requested page SELECT SS.* FROM StreetSign SS INNER JOIN @TempTable TT ON TT.StreetSignId = SS.[Id] WHERE TT.RowNumber BETWEEN ((@Page - 1) * @PageSize + 1) AND (@Page * @PageSize)

In SQL Server 2005, you can get smarter with things like Common Table Expressions and the new SQL ranking features. But the common theme is that you use the server to return only information related to the current page.

Remember that this approach can become messy if you allow the end user to apply filters on the fly to the data that he sees.

Brian schmitt · Answer 3 · 2008-10-23T22:41:13+0000

I would say if the potential exists for a large dataset, then go along the search route.

You can still install the MAX you don’t want them to transition.

eg. SO uses page sizes 15, 30, 50 ...

Musigenesis · Answer 4 · 2008-10-23T23:56:30+0000

One thing to be wary of when working with native wrapper classes such as you (apparently) is code that makes extra calls to the database if you (the developer) are not aware of this. For example, you can call a method that returns a collection of Person objects and think that the only thing that happens under the hood is the only call to "SELECT * FROM PERSONS". In fact, the method you call can go through the returned collection of Person objects and make additional database calls to populate each Person Order collection.

As you say, one of your solutions is not to return fully loaded objects, so you probably know about this potential problem. One of the reasons I try to avoid using string wrappers is because they invariably make it difficult to tune your application and minimize the size and frequency of database traffic.

Ty. · Answer 5 · 2008-10-23T22:39:37+0000

In ASP.NET, I would use server-side paging, where you only retrieve a page of the data requested by the user from the data store. This is contrary to retrieving the entire result set, storing it in memory, and scrolling through it on demand.

dacracot · Answer 6 · 2008-10-23T22:39:46+0000

JSF or JavaServerFaces have widgets for breaking large result sets into a browser. It can be parameterized as you suggest. I would not call it the “industry standard design template” in any way, but it's worth seeing how someone else solved the problem.

Gunny · Answer 7 · 2008-10-23T22:50:05+0000

When I encounter this type of problem, I usually make a piece of data sent to the browser (or thin / fat client, depending on what is more suitable for your situation), regardless of the actual total amount of data that satisfy some specific criteria, only a small portion can really be used in any user interface at a time.

I live in the Microsoft world, so my main environment is ASP.Net with SQL Server. Here are two articles on paging (which mention some paging methods through result sets) that may be useful:

Paging through a lot of data efficiently (and in the Ajax way) using ASP.NET 2.0 Efficient paging through data using an ASP.NET 2.0 DataList control and an ObjectDataSource

Another mechanism recently sent by Microsoft is their idea of Dynamic Data - you could check the guts of this for some recommendations on how they deal with this problem.

Niniki · Answer 8 · 2008-10-23T22:51:27+0000

I did similar things on two different products. In one case, the data source is not necessarily paginated - for java it implements an interface with web pages similar to:

 public interface Pageable { public void setStartIndex( int index ); public int getStartIndex(); public int getRowsPerPage() throws Exception; public void setRowsPerPage( int rowsPerPage ); }

The data source implements another method for get () elements, and the implementation of the paginated data source simply returns the current page. This way you can set the starting index and grab the page in your controller.

One thing to consider is to cache the server side of your cursors. For a web application, you will have to finish them, but they will really help improve performance.

Mattsmith · Answer 9 · 2008-10-23T23:25:12+0000

The fedora digital repository project returns the maximum number of results with a result identifier. Then you will get the rest of the result by requesting the next fragment in which the identifier of the result will be set in the subsequent query. It works fine until you want to search or sort outside the query.

Patrick · Answer 10 · 2008-10-24T19:40:33+0000

From the datay search level, the standard design pattern should have two method interfaces: one for all and one for the block size.

If you want, you can add up the components that execute pages on it.

How to process huge result sets from a database

More articles: