Reducing bloat in tables and indexes with the pg_repack extension
You can use the pg_repack
extension to remove bloat from tables and indexes
as an alternative to VACUUM FULL
. This extension is supported on RDS for PostgreSQL
versions 9.6.3 and higher. For more information on the pg_repack
extension and
the full table repack, see the GitHub project
documentation
Unlike VACUUM FULL
, the pg_repack
extension requires an
exclusive lock (AccessExclusiveLock) only for a short period of time during the table rebuild
operation in the following cases:
-
Initial creation of log table – A log table is created to record changes that occur during initial copy of the data, as shown in the following example:
postgres=>
\dt+ repack.log_* List of relations -[ RECORD 1 ]-+---------- Schema | repack Name | log_16490 Type | table Owner | postgres Persistence | permanent Access method | heap Size | 65 MB Description |
-
Final swap-and-drop phase.
For the rest of the rebuild operation, it only needs an ACCESS SHARE
lock on
the original table to copy rows from it to the new table. This helps the INSERT, UPDATE, and
DELETE operations to proceed as usual.
Recommendations
The following recommendations apply when you remove bloat from the tables and indexes
using the pg_repack
extension:
-
Perform repack during non-business hours or over a maintenance window to minimize its impact on performance of other database activities.
-
Closely monitor blocking sessions during the rebuild activity and ensure that there is no activity on the original table that could potentially block
pg_repack
, specifically during the final swap-and-drop phase when it needs an exclusive lock on the original table. For more information, see Identifying what is blocking a query. When you see a blocking session, you can terminate it using the following command after careful consideration. This helps in the continuation of
pg_repack
to finish the rebuild:SELECT pg_terminate_backend(
pid
); -
While applying the accrued changes from the
pg_repack's
log table on systems with a very high transaction rate, the apply process might not be able to keep up with the rate of changes. In such cases,pg_repack
would not be able to complete the apply process. For more information, see Monitoring the new table during the repack. If indexes are severely bloated, an alternative solution is to perform an index only repack. This also helps VACUUM's index cleanup cycles to finish faster.You can skip the index cleanup phase using manual VACUUM from PostgreSQL version 12, and it is skipped automatically during emergency autovacuum from PostgreSQL version 14. This helps VACUUM complete faster without removing the index bloat and is only meant for emergency situations such as preventing wraparound VACUUM. For more information, see Avoiding bloat in indexes in the Amazon Aurora User Guide.
Pre-requisites
-
The table must have PRIMARY KEY or not-null UNIQUE constraint.
-
The extension version must be the same for both the client and the server.
-
Ensure that the RDS instance has more
FreeStorageSpace
than the total size of the table without the bloat. As an example, consider the total size of the table including TOAST and indexes as 2TB, and total bloat in the table as 1TB. The requiredFreeStorageSpace
must be more than value returned by the following calculation:2TB (Table size)
-1TB (Table bloat)
=1TB
You can use the following query to check the total size of the table and use
pgstattuple
to derive bloat. For more information, see Diagnosing table and index bloat in the Amazon Aurora User GuideSELECT pg_size_pretty(pg_total_relation_size('table_name')) AS total_table_size;
This space is reclaimed after the completion of the activity.
-
Ensure that the RDS instance has enough compute and IO capacity to handle the repack operation. You might consider to scale up the instance class for optimal balance of performance.
To use the pg_repack
extension
-
Install the
pg_repack
extension on your RDS for PostgreSQL DB instance by running the following command.CREATE EXTENSION pg_repack;
-
Run the following commands to grant write access to temporary log tables created by
pg_repack
.ALTER DEFAULT PRIVILEGES IN SCHEMA repack GRANT INSERT ON TABLES TO PUBLIC; ALTER DEFAULT PRIVILEGES IN SCHEMA repack GRANT USAGE, SELECT ON SEQUENCES TO PUBLIC;
-
Connect to the database using the
pg_repack
client utility. Use an account that hasrds_superuser
privileges. As an example, assume thatrds_test
role hasrds_superuser
privileges. The following syntax performspg_repack
for full tables including all the table indexes in thepostgres
database.pg_repack -h
db-instance-name
.111122223333.aws-region
.rds.amazonaws.com -Urds_test
-kpostgres
Note
You must connect using the -k option. The -a option is not supported.
The response from the
pg_repack
client provides information on the tables on the DB instance that are repacked.INFO: repacking table "pgbench_tellers" INFO: repacking table "pgbench_accounts" INFO: repacking table "pgbench_branches"
-
The following syntax repacks a single table
orders
including indexes inpostgres
database.pg_repack -h db-instance-name.111122223333.aws-region.rds.amazonaws.com -U
rds_test
--tableorders
-kpostgres
The following syntax repacks only indexes for
orders
table inpostgres
database.pg_repack -h db-instance-name.111122223333.aws-region.rds.amazonaws.com -U
rds_test
--tableorders
--only-indexes -kpostgres
Monitoring the new table during the repack
-
The size of the database is increased by the total size of the table minus bloat, until swap-and-drop phase of repack. You can monitor the growth rate of the database size, calculate the speed of the repack, and roughly estimate the time it takes to complete initial data transfer.
As an example, consider the total size of the table as 2TB, the size of the database as 4TB, and total bloat in the table as 1TB. The database total size value returned by the calculation at the end of the repack operation is the following:
2TB (Table size)
+4 TB (Database size)
-1TB (Table bloat)
=5TB
You can roughly estimate the speed of the repack operation by sampling the growth rate in bytes between two points in time. If the growth rate is 1GB per minute, it can take 1000 minutes or 16.6 hours approximately to complete the initial table build operation. In addition to the initial table build,
pg_repack
also needs to apply accrued changes. The time it takes depends on the rate of applying ongoing changes plus accrued changes.Note
You can use
pgstattuple
extension to calculate the bloat in the table. For more information, see pgstattuple. -
The number of rows in the
pg_repack's
log table, under the repack schema represents the volume of changes pending to be applied to the new table after the initial load.You can check the
pg_repack's
log table inpg_stat_all_tables
to monitor the changes applied to the new table.pg_stat_all_tables.n_live_tup
indicates the number of records that are pending to be applied to the new table. For more information, see pg_stat_all_tables. postgres=>
SELECT relname,n_live_tup FROM pg_stat_all_tables WHERE schemaname = 'repack' AND relname ILIKE '%log%';
-[ RECORD 1 ]--------- relname | log_16490 n_live_tup | 2000000
-
You can use the
pg_stat_statements
extension to find out the time taken by each step in the repack operation. This is helpful in preparation for applying the same repack operation in a production environment. You may adjust theLIMIT
clause for extending the output further.postgres=>
SELECT SUBSTR(query, 1, 100) query, round((round(total_exec_time::numeric, 6) / 1000 / 60),4) total_exec_time_in_minutes FROM pg_stat_statements WHERE query ILIKE '%repack%' ORDER BY total_exec_time DESC LIMIT 5;
query | total_exec_time_in_minutes -----------------------------------------------------------------------+---------------------------- CREATE UNIQUE INDEX index_16493 ON repack.table_16490 USING btree (a) | 6.8627 INSERT INTO repack.table_16490 SELECT a FROM ONLY public.t1 | 6.4150 SELECT repack.repack_apply($1, $2, $3, $4, $5, $6) | 0.5395 SELECT repack.repack_drop($1, $2) | 0.0004 SELECT repack.repack_swap($1) | 0.0004 (5 rows)
Repacking is completely an out-of-place operation so the original table is not impacted and we do not anticipate any unexpected challenges that require recovery of the original table. If repack fails unexpectedly, you must inspect the cause of the error and resolve it.
After the issue is resolved, drop and recreate the pg_repack
extension in the
database where the table exists, and retry the pg_repack
step. In addition, the
availability of compute resources and concurrent accessibility of the table plays a crucial
role in the timely completion of the repack operation.