How to handle pagination with frequent changes to the source data

Question

How to handle pagination with frequent changes to the source data

In particular, I use Elasticsearch for pagination, but this question can apply to any database.

Elasticsearch provides page search results methods with convenient from and to parameters.

So I run the query get me the most recent data from result 1 to 10

This works great.

The user clicks "next page" and the query: get me the most recent data from result 11 to 20

The problem is that during the time between two queries, 2 new records were added to the backup database, which means that paginated results will overlap (the last 2 from the first page are displayed as the first two on the second page).

What is the best solution to avoid this? Right now, I'm adding a filter to the query that tells it to include the results later than the last result of the previous query. But it just seems hacky.

+5

pagination paging elasticsearch

bradvido Jan 15 '15 at 17:11

source share

2 answers

For this you need to use the scan API. The scan and scroll API allows you to search in time and paginate. Scan API -

-1

Vineeth mohan Jan 15 '15 at 18:15

source share

Nick zadrozny · Accepted Answer · 2015-01-15T18:45:06+0000

A filter is not a bad option if you are already indexing the appropriate timestamp. You must track this client-side timestamp in order to properly prepare your requests. You should also know when to get rid of it. But these are not insurmountable problems.

The scroll API is a reliable option for this, as it effectively takes snapshots on time on the Elasticsearch side. The purpose of the scroll API is to provide a stable search query for deep pagination, which should deal with the exact change problem that you are experiencing.

You start scrolling the search by providing your query and the scroll parameter, for which Elasticsearch returns scroll_id . Then you send requests to /_search/scroll , supplying this identifier, each of which returns a results page and a new scroll_id for the next request.

(Note that the scan search type is not needed here. This is used to extract documents in bulk and does not apply any sorting.)

Compared to filtering, you still need to keep track of the value: scroll_id for the next page of results. Or it’s easier than tracking the timestamp, it depends on your application.

There are other potential disadvantages. Elasticsearch maintains the context for your search in a single node within the cluster. Apparently, they can accumulate in your cluster, depending on how much you rely on the search scroll. You will want to check the impact on performance. And if I remember correctly, scrolling searches are also not saved when a node fails or reboots.

The ES documentation for the scroll API contains details on all of the above.

Bottom line: filtering by timestamp is actually not a bad choice. The scroll API is another valid option designed for a similar use case, but not without its drawbacks.

How to handle pagination with frequent changes to the source data

More articles: