Long update in postgresql

I use UPDATE in a table containing 250 million rows with 3 indexes; this UPDATE uses another table containing 30 million rows. It works for about 36 hours. I wonder if their way of knowing how close this is being done, if he plans to spend a million days doing his job, I will kill him; but if he needs only one day or two, I will let him work. Here is the command request:

UPDATE pagelinks SET pl_to = page_id FROM page WHERE (pl_namespace, pl_title) = (page_namespace, page_title) AND page_is_redirect = 0 ; 

EXPLAIN is not a problem here, and I only mention a large table with several indexes to justify somewhat how long it will take to UPDATE. But here is EXPLAIN anyway:

 Merge Join (cost=127710692.21..135714045.43 rows=452882848 width=57) Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?")) -> Sort (cost=3193335.39..3219544.38 rows=10483593 width=41) Sort Key: page.page_namespace, (page.page_title)::text -> Seq Scan on page (cost=0.00..439678.01 rows=10483593 width=41) Filter: (page_is_redirect = 0::numeric) -> Sort (cost=124517356.82..125285665.74 rows=307323566 width=46) Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text" -> Seq Scan on pagelinks (cost=0.00..6169460.66 rows=307323566 width=46) 

Now I also sent a parallel query command to DROP one of the pagelinks indexes; of course, he is waiting for UPDATE to finish (but I still wanted to try it!). Consequently, I cannot SELECT any of the pagelinks for fear of data corruption (unless you think it would be safe to kill the postmaster DROP INDEX process?).

So I wonder if there is a table that will track the number of dead tuples or something else. It would be nice to know how fast or how UPDATE is at the end of its task.

thanks (PostgreSQL is not as smart as I thought, it needs heuristics)

+6
sql sql-update postgresql
source share
2 answers

Have you read the PostgreSQL documentation for " Using EXPLAIN " to interpret the output you are showing?

I am not a regular PostgreSQL user, but I just read this document and then compared it with the EXPLAIN output that you are showing. Your UPDATE query does not seem to use indexes, and it made tables scan to sort both page and pagelinks . The sort is, without a doubt, large enough to require temporary files on disk, which I think are created under your temp_tablespace .

Then I see how the evaluation pages of the database are viewed. The top level of this EXPLAIN output says (cost=127710692.21..135714045.43) . The units here are in disk I / O. Thus, he is going to access the disk more than 135 million times to do this UPDATE .

Please note that even 10,000 rpm drives with a search time of 5 ms can provide at best 200 I / O operations per second under optimal conditions. This would mean that your UPDATE will take 188 hours (7.8 days) of disk I / O, even if you can support saturated disk I / O over this period (i.e., Continuous read / write without interruption). This is not possible, and I expect the actual bandwidth to be disabled at least an order of magnitude, especially since you no doubt used this server to do all kinds of other work. Therefore, I would suggest that you are only part of the way through UPDATE .

If it were me, I would kill this query on the first day and find another way to execute UPDATE , which used indexes better and did not require sorting on disk. You probably can't do this in a single SQL statement.

As for your DROP INDEX , I would suggest that it just blocks, waiting for exclusive access to the table, and while it is in this state, I think you can probably kill it.

+4
source share

You need indexes or, as Bill noted, he will need to perform sequential checks on all tables.

 CREATE INDEX page_ns_title_idx on page(page_namespace, page_title); CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title); CREATE INDEX page_redir_idx on page(page_is_redirect); 
0
source share

All Articles