Correct use of cursors for very large result sets in Postgres

Short version of my question:

If I keep the cursor link to the astronomically huge result given in my client code, would it be ridiculous (that is, to completely defeat the cursor point) to issue "FETCH ALL FROM cursorname" as my next command? Or will it slowly transfer data to me when I use it (at least in principle, believing that I have a well-written driver sitting between me and Postgres)?

More details

If I understand things correctly, then Postgres cursors are REALLY to solve the following problem [even if they can be used (abused?) For other things, for example, returning several different result sets from one function]:

Note. The current implementation of RETURN NEXT and RETURN QUERY saves the entire result set before returning from the function, as discussed above. This means that if the PL / pgSQL function creates a very large set of results, the performance may be low: the data will be written to disk to avoid running out of memory, but the function will not be there until the entire result set has been formed.

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-control-structures.html )

But (again, if I understand correctly), when you write a function that returns a cursor, then the entire request is NOT buffered into memory (and disk) before the user of the function can start consuming anything, but instead the results can be consumed in parts . (There are more overhead settings with the cursor, but it's worth it to avoid massive buffer allocations for very large result sets.)

(ref: https://www.postgresql.org/docs/9.6/static/plpgsql-cursors.html#AEN66551 )

I would like to understand how this applies to SELECTS and FETCHES through a wire to a Postgres server.

In all cases, I'm talking about consuming results from client code that communicates with Postgres on a socket behind the scenes (actually, using the Npgsql library in my case).

Q1: What if I try to execute "SELECT * FROM AstronomicallyLargeTable" as my only Postgres posting command? Would it allocate all the memory for the whole selection and then start sending data to me? Or does it (efficiently) generate its own cursor and a little sweat the data back (without a huge extra buffer allocation on the server)?

Q2: What if I already have a cursor link to an astronomically large result set (say, because I already made one round trip and returned the cursor link from some function), and then I execute "FETCH ALL FROM cursorname "through the wire to Postgres? Is this stupid because it will allocate ALL memory for all results on the Postgres server before sending anything back to me? Or does "FETCH ALL FROM cursorname" actually work as I would like, streaming data streams slowly when I consume them, without any massive buffer allocation on the Postgres server?

EDIT: Further Clarification

I ask about the case when I know that my level of data access transfers data from the server to me one line at a time (so that it does not have a large number of buffers on the client side, no matter how long data streams are) and where I I also know that my own application consumes data one line at a time, and then discards it (so there are no buffers on the client side). I definitely DO NOT want to fetch all of these strings into client-side memory, and then do something with them. I see that it will be absolutely stupid!

So, I think that all the problems (for the use case just described) is how long PostgreSQL will take the threads and how much memory buffer it would allocate for FETCH ALL . IF (and this is a big "IF" ...) PostgreSQL does not allocate a huge buffer of all rows before running, and if it passes the rows back to Npgsql one at a time, starting fast, then I believe (but please tell me why / if I'm wrong) that there is still an explicit use case for FETCH ALL FROM cursorname !

+7
postgresql cursor
source share
3 answers

After some experimentation, PostgreSQL seems to behave as follows:

  • Retrieving multiple rows using SELECT * FROM large will not create a temporary file on the server side, the data will be transferred as it is scanned.

  • If you create a server-side cursor with a function that returns a refcursor and retrieves the rows from the cursor, all returned rows are first collected on the server. This leads to the creation of a temporary file if you run FETCH ALL .

Here are my experiments with a table containing 1,000,000 rows. work_mem set to 64kb (minimum). log_temp_files set to 0, so temporary files are reported in the server log.

  • First try:

     SELECT id FROM large; 

    Result: a temporary file is not created.

  • Second attempt:

     CREATE OR REPLACE FUNCTION lump() RETURNS refcursor LANGUAGE plpgsql AS $$DECLARE c CURSOR FOR SELECT id FROM large; BEGIN c := 'c'; OPEN c; RETURN c; END;$$; BEGIN; SELECT lump(); lump ------ c (1 row) FETCH NEXT FROM c; id ---- 1 (1 row) FETCH NEXT FROM c; id ---- 2 (1 row) COMMIT; 

    Result: a temporary file is not created.

  • Third attempt:

     BEGIN; SELECT lump(); lump ------ c (1 row) FETCH all FROM c; id --------- 1 2 3 ... 999999 1000000 (1000000 rows) COMMIT; 

    Result: a temporary file of about 140 MB in size is created.

I really don't know why PostgreSQL behaves this way.

+2
source share

There is one thing missing from your question if you really need the plpgsql function and not the sql built-in function. I just explain this because your description is a simple script - select * from hugetable . Therefore, I am going to answer a question based on this information.

In this case, your problem is not a problem because the function call may be invisible. I want to say that if you can write a function as an inline SQL function that you do not specify one way or another, you do not need to worry about this particular limitation of plpgsql RETURN QUERY .

 CREATE OR REPLACE FUNCTION foo() RETURNS TABLE (id INT) AS $BODY$ SELECT * FROM bar; $BODY$ LANGUAGE SQL STABLE; 

Look at the plan:

 EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo() LIMIT 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..0.01 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1) Buffers: shared hit=1 -> Seq Scan on bar (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.014..0.014 rows=1 loops=1) Buffers: shared hit=1 Planning time: 0.082 ms Execution time: 0.031 ms (6 rows) 

There is no complete result set to be populated and then returned.

https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions

I will answer other answers here if you really need plpgsql to make non-sql foo, but it really needed to be said here.

+1
source share

When you need to process an astronomically large data set and you use SELECT * FROM or RETURN QUERY , you need an astronomically large buffer not only on the server, but also on the client. And then you need to wait astronomically long for it to reach its network. Cursors are not used internally.

When using CURSOR you can overcome buffering, but FETCH ALL will be just plain stupid, because you force the cursor to abandon what it is intended to create: present data from the database in parts. On the server side, you avoid buffering, because data is transmitted over the network as it is created, but the client side will still need to buffer all the data.

Some frameworks (like Hibernate) do the buffering behind the scenes, but I don’t know the similar functionality in lower level libraries like Npgsql or the JDBC driver. But this buffering also comes at a price, in particular, an astronomically large number of SELECT * FROM table LIMIT 1000 OFFSET 23950378000 or something like that.

In any case, if you really have such large amounts of data to process, you are much better at server-side processing, for example. in the PL / pgSQL function, and then send the results to the client. Not only server computers are usually more efficient than clients, you also avoid most of the network overhead.

-one
source share

All Articles