Stream with lots of UPDATE and PostgreSQL

Question

Stream with lots of UPDATE and PostgreSQL

I am new to PostgreSQL optimization and choose any suitable job for it and what not. So I want to know when I try to use PostgreSQL for inadequate work, or is suitable for it, and I have to configure everything correctly.

Anyway, I need a database with lots of data that change frequently.

For example, imagine an Internet service provider that has many clients, each of which has a session (PPP / VPN / whatever), with two self-describing, frequently updated properties, bytes_received and bytes_sent . There is a table with them, where each session is represented by a line with a unique identifier:

 CREATE TABLE sessions( id BIGSERIAL NOT NULL, username CHARACTER VARYING(32) NOT NULL, some_connection_data BYTEA NOT NULL, bytes_received BIGINT NOT NULL, bytes_sent BIGINT NOT NULL, CONSTRAINT sessions_pkey PRIMARY KEY (id) )

And as accounting data streams, this table receives many UPDATES, such as:

 -- There are *lots* of such queries! UPDATE sessions SET bytes_received = bytes_received + 53554, bytes_sent = bytes_sent + 30676 WHERE id = 42

When we get an endless stream with quite a few (for example, 1-2 per second) updates for a table with a lot of (for example, several thousand) sessions, probably thanks to MVCC, this makes PostgreSQL very busy. Is there a way to speed things up, or is Postgres just not suitable for this task, and I would find it unsuitable for this job and put these counters in another repository, like memcachedb, using Postgres only for pretty static data? But I will miss the opportunity to rarely request this data, for example, to find the TOP10 loaders, which is not very good.

Unfortunately, the amount of data cannot be significantly reduced. The ISP calculation example is intended to simplify the explanation. The real problem is with another system, the structure of which is somehow more difficult to explain.

Thanks for the suggestions!

+6

optimization sql sql-update postgresql

drdaeman Nov 02 '09 at 18:41

source share

2 answers

You want to collect statistical updates, because they occur in a queue for memory in some way or, alternatively, on the message bus, if you are more ambitious. The admission process then aggregates these statistical updates on a periodic basis, which can be from one hour every 5 seconds to every hour, depending on what you want. Then the bytes_received and bytes_sent values are counted, with counts, which can be many separate "update" messages, summarized together. In addition, you must expose update instructions for multiple identifiers in a single transaction, ensuring that update statements are issued in the same relative order with respect to the primary key, to prevent deadlocks for other transactions that can do the same.

Thus, you perform “batch” actions in larger chunks to control how much load is in the PG database, as well as serialize many parallel actions into one thread (or several, depending on how many threads / processes issue updates) , The trade-off that you set up based on the "period" is how fresh and how busy the load is.

+6

zzzeek Nov 02 '09 at 18:52

source share

Ants aasma · Accepted Answer · 2009-11-02T20:18:35+0000

The database is actually not the best tool for collecting a large number of small updates, but since I do not know your requests and ACID requirements, I can not recommend anything else. If this is an acceptable approach, the aggregation of updates on the application side offered by zzzeek can significantly reduce the load on the update.

There is a similar approach that can give you strength and the ability to request more recent data at some performance. Create a buffer table that can collect the changes for the values that need to be updated and paste the changes there. At regular intervals in the transaction, rename the table to something else and create a new table instead. Then, in the transaction, summarize all the changes, make the appropriate updates in the main table, and trim the buffer table. Thus, if you need a consistent and fresh snapshot of any data that you can select from the main table and join all the changes from the active and renamed buffer tables.

However, if none of them are acceptable, you can also tune the database to better handle heavy updates.

To optimize the update, make sure that PostgreSQL can only use heaps for storage to store updated versions of strings. To do this, make sure that there are no indexes in frequently updated columns and change the fillfactor to a level below 100% by default. You will need to determine the appropriate fill factor yourself, as it is highly dependent on the details of the workload and the machine on which it works. The fill level should be low enough that almost all updates match the same database page before autovacuum can clear old invisible versions. You can configure auto-vacuum settings to trade between database density and vacuum. Also note that any long transactions, including statistical queries, will be stored on tuples that changed after the start of the transaction. See the pg_stat_user_tables view to find out what needs to be configured, especially the relation of n_tup_hot_upd to n_tup_upd and n_live_tup to n_dead_tup.

A heavy update will also put a strain on the recording with a large recording volume (WAL). Customizing WAL behavior ( docs for customization ) will help reduce this. In particular, a higher checkpoint number and a higher checkpoint_timeout can significantly reduce the load on I / O, allowing you to do more updates in memory. See the relationship checkpoints_timed vs checkpoints_req in pg_stat_bgwriter for how many breakpoints have occurred because the limit has been reached. Raising your shared_buffers will also help, so that the working set matches memory. Check buffers_checkpoint and buffers_clean + buffers_backend to see how many were written to meet checkpoint requirements or simply out of memory.

Stream with lots of UPDATE and PostgreSQL

More articles: