VACUUM on Redshift (AWS) after DELETE and INSERT

I have a table as shown below (simplified example, we have more than 60 fields):

CREATE TABLE "fact_table" ( "pk_a" bigint NOT NULL ENCODE lzo, "pk_b" bigint NOT NULL ENCODE delta, "d_1" bigint NOT NULL ENCODE runlength, "d_2" bigint NOT NULL ENCODE lzo, "d_3" character varying(255) NOT NULL ENCODE lzo, "f_1" bigint NOT NULL ENCODE bytedict, "f_2" bigint NULL ENCODE delta32k ) DISTSTYLE KEY DISTKEY ( d_1 ) SORTKEY ( pk_a, pk_b ); 

The table is distributed by dimension with high performance.

The table is sorted by a pair of fields that increase in time.

The table contains over 2 billion rows and uses ~ 350 GB of disk space, like "per node".


Our hourly trip includes updating some recent records (within the last 0.1% of the table based on the sort order) and inserting another 100 thousand rows.

Whatever mechanism we choose, VACUUMING the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes 6 hours

From SELECT * FROM svv_vacuum_progress; it is clear that all 2 billion lines are combined. Even if the first 99.9% are not completely affected.


Our understanding was that the merger should only affect: 1. Deleted records
2. Entries
3. And all entries from (1) or (2) to the end of the table


We tried DELETE and INSERT , not UPDATE , and now the DML step is much faster. But VACUUM still combines all 2 billion lines.

 DELETE FROM fact_table WHERE pk_a > X; -- 42 seconds INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b; -- 90 seconds VACUUM fact_table; -- 23645 seconds 

In fact, VACUUM combines all 2 billion records, even if we just trim the last 746 rows from the end of the table.


Question

Does anyone have any tips on how to avoid this huge VACUUM overhead and only merge in the last 0.1% of the table?

+7
sql amazon-web-services amazon-redshift
source share
2 answers

How often do you vacuing tables? How does a lasting effect affect you? Our download processing continues to run during VACUUM, and we never experienced any performance issues with this. Basically, it doesn't matter how long it takes, because we just continue to work with BAU.

I also found that we do not need VACUUM our large tables very often. Once a week more than enough. Your use case may be very performance sensitive, but we find that the query time is within the usual changes until the table is more than, say, 90% unsorted.

If you find a significant difference in performance, have you considered using the latter and history tables (if necessary in the UNION view)? Thus, you can quickly VACUUM get a small "recent" table.

+1
source share

Failed to fix this in the comments section, so post it as an answer

I think right now, if the SORT keys are the same in the time series tables, and you have a UNION ALL representation as a time series and still poor performance, then you might need a time series representation structure with explicit filters like

 create or replace view schemaname.table_name as select * from table_20140901 where sort_key_date = '2014-09-01' union all select * from table_20140902 where sort_key_date = '2014-09-02' union all ....... select * from table_20140925 where sort_key_date = '2014-09-25'; 

Also, make sure that statistics are collected in all these tables by sorting keys after each load and try to execute queries against it. It should be able to push any filter values ​​into the view if you use them. At the end of the day after loading, just run ONLY VACUUM VARIETY or full vacuum in the current daily table, which should be much faster.

Let me know if you still encounter any problems after the above test.

0
source share

All Articles