I have a table as shown below (simplified example, we have more than 60 fields):
CREATE TABLE "fact_table" ( "pk_a" bigint NOT NULL ENCODE lzo, "pk_b" bigint NOT NULL ENCODE delta, "d_1" bigint NOT NULL ENCODE runlength, "d_2" bigint NOT NULL ENCODE lzo, "d_3" character varying(255) NOT NULL ENCODE lzo, "f_1" bigint NOT NULL ENCODE bytedict, "f_2" bigint NULL ENCODE delta32k ) DISTSTYLE KEY DISTKEY ( d_1 ) SORTKEY ( pk_a, pk_b );
The table is distributed by dimension with high performance.
The table is sorted by a pair of fields that increase in time.
The table contains over 2 billion rows and uses ~ 350 GB of disk space, like "per node".
Our hourly trip includes updating some recent records (within the last 0.1% of the table based on the sort order) and inserting another 100 thousand rows.
Whatever mechanism we choose, VACUUMING the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes 6 hours
From SELECT * FROM svv_vacuum_progress; it is clear that all 2 billion lines are combined. Even if the first 99.9% are not completely affected.
Our understanding was that the merger should only affect: 1. Deleted records
2. Entries
3. And all entries from (1) or (2) to the end of the table
We tried DELETE and INSERT , not UPDATE , and now the DML step is much faster. But VACUUM still combines all 2 billion lines.
DELETE FROM fact_table WHERE pk_a > X;
In fact, VACUUM combines all 2 billion records, even if we just trim the last 746 rows from the end of the table.
Question
Does anyone have any tips on how to avoid this huge VACUUM overhead and only merge in the last 0.1% of the table?
sql amazon-web-services amazon-redshift
MatBailie
source share