Recently, we had to upgrade a heavily loaded PostgreSQL cluster from version 13 to 16 while keeping downtime minimal. The cluster, consisting of a master and a replica, was handling over 20,000 transactions per second. The process of upgrading PostgreSQL to a newer version also helped resolve several performance and maintenance issues.
Particularly, as the system’s load grew, the existing servers became a bottleneck. They had to be replaced with more powerful machines with larger disk volumes. This was necessary to handle the increasing amount of data while keeping high transaction processing speeds.
On top of that, upgrading to a newer version of PostgreSQL brought new features and enhancements to streamline application development and optimization. Version 13 is also approaching its end-of-life later this year, so the upgrade was essential to ensure we have access to the latest patches and updates.
In this article, I’ll walk you through the upgrade process and show how to overcome the challenges we ran into along the way. In general, the procedure involved two main stages:
- First, we created a new PostgreSQL replica running the desired version via logical replication.
- Next, we transferred the master role to the new replica.
Creating a new replica: Logical replication
Our initial plan was to set up logical replication and test it for the existing database. Here’s what we did to achieve that.
- Migrate the PostgreSQL database and the schema of the desired database:
pg_dumpall --database=postgres --host=a.b.c.d --no-password --globals-only --no-privileges | psql
pg_dump --dbname our_db --host=a.b.c.d --no-password --create --schema-only | psql
- Create a database publication on the master:
CREATE PUBLICATION our_pub FOR ALL TABLES;
- Create a subscription on the replica:
CREATE SUBSCRIPTION our_sub CONNECTION 'host=a.b.c.d dbname=our_db' PUBLICATION our_pub;
- Check that the publication works on the master:
SELECT * FROM pg_catalog.pg_publication;
- Check that the subscription works on the replica:
SELECT * FROM pg_stat_subscription;
By doing so, we got a replication rate of about 1 GB per minute. With a 3.5 TB database, the migration would take 2-3 days to fully synchronize. But while the database is migrating, the WALs for the replication slot would accumulate. Potentially, the space allotted for them could fill up, causing the master to crash.
Obviously, we don’t want that to happen, and so we decided to stop synchronization, coming up with another option. We decided to run physical replication first and then switch to logical one.
Physical and logical replication
The draft action plan was as follows:
- Create a physical replica.
- Wait for the physical replica to get to the synchronization point.
- Switch to logical replication.
- Upgrade PostgreSQL to version 16 using
pg_upgrade
.
Testing showed physical replication rates as high as 10 GB per minute, while the rate of WALs growth was not as high as we expected.
But one thing still kept us on our toes: how would a logical replica figure out at what point in the LSN it should start working? Fortunately, this one is simple: when configuring logical replication, set the LSN for the slot to use as the starting point. Here is the command to do just that:
SELECT pg_create_logical_replication_slot('logical_replica_slot', 'pgoutput');
SELECT pg_replication_slot_advance('logical_replica_slot', '0/3402801');
But how do you determine the LSN value to advance a replication logical slot to? Once the replica is taken offline via promote
, a line like this would appear in the PostgreSQL logs: redo done at 0/3402801
. It contains information about the last applied transaction’s location. In this case, LSN equals 0/3402801
. Hence, transactions occurring after this point will be applied to the new instance.
Once we had figured out the process, we moved on to specific actions:
- Create a physical replica.
- Stop it and promote the instance to autonomous mode.
- Switch to logical replication.
- Upgrade PostgreSQL to version 16.
First, we tested the PostgreSQL upgrade from version 13 to version 16, which went well:
postgres:~$ /usr/lib/postgresql/16/bin/pg_upgrade --old-datadir=/var/lib/postgresql/13/main --new-datadir=/var/lib/postgresql/16/main --old-bindir=/usr/lib/postgresql/13/bin --new-bindir=/usr/lib/postgresql/16/bin --old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' --new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' --link
Performing Consistency Checks
Checking cluster versions ok
Checking database user is the install user ok
Checking database connection settings ok
Checking for prepared transactions ok
Checking for system-defined composite types in user tables ok
Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch ok
Checking for incompatible "aclitem" data type in user tables ok
Checking for user-defined encoding conversions ok
Checking for user-defined postfix operators ok
Checking for incompatible polymorphic functions ok
Creating dump of global objects ok
Creating dump of database schemas ok
Checking for presence of required libraries ok
Checking database user is the install user ok
Checking for prepared transactions ok
Checking for new cluster tablespace directories ok
If pg_upgrade fails after this point, you must re-initdb the new cluster before continuing.
Performing Upgrade
Setting locale and encoding for new cluster ok
Analyzing all rows in the new cluster ok
Freezing all rows in the new cluster ok
Deleting files from new pg_xact ok
Copying old pg_xact to new server ok
Setting oldest XID for new cluster ok
Setting next transaction ID and epoch for new cluster ok
Deleting files from new pg_multixact/offsets ok
Copying old pg_multixact/offsets to new server ok
Deleting files from new pg_multixact/members ok
Copying old pg_multixact/members to new server ok
Setting next multixact ID and offset for new cluster ok
Resetting WAL archives ok
Setting frozenxid and minmxid counters in new cluster ok
Restoring global objects in the new cluster ok
Restoring database schemas in the new cluster ok
Adding ".old" suffix to old global/pg_control ok
If you want to start the old cluster, you will need to remove
the ".old" suffix from /var/lib/postgresql/13/main/global/pg_control.old.
Because "link" mode was used, the old cluster cannot be safely
started once the new cluster has been started.
Linking user relation files ok
Setting next OID for new cluster ok
Sync data directory to disk ok
Creating script to delete old cluster ok
Checking for extension updates notice
Your installation contains extensions that should be updated
with the ALTER EXTENSION command. The file
update_extensions.sql
when executed by psql by the database superuser will update
these extensions.
Upgrade Complete
Next, we switched a physical replica to a logical one.
On the master, we create the logical replication slot, check it and create a publication for all tables:
\c our_db
SELECT pg_create_logical_replication_slot('logical_replica_slot', 'pgoutput');
SELECT * from pg_replication_slots;
CREATE PUBLICATION our_db_pub FOR ALL TABLES;
Then, on the replica, we promote it to a standalone one and look for the LSN:
\c postgres
SELECT pg_promote();
cat /var/log/postgresql/postgresql-13-main.log | grep "redo done at"
Back to the master, where we move the logical replication slot to the specific LSN:
\c our_db
SELECT pg_replication_slot_advance('logical_replica_slot', 'E330A/4903A1');
Finally, we create a subscription on the replica:
\c our_db
CREATE SUBSCRIPTION our_db_sub CONNECTION 'host=a.b.c.d dbname=our_db' PUBLICATION our_db_pub WITH (copy_data=false, slot_name='logical_replica_slot', create_slot=false);
We are ready to upgrade it to version 16!
Upgrading to PostgreSQL 16
Here’s what we do to upgrade the PostgreSQL version:
- Install version 16. We’re using Ubuntu Linux, so it’s:
apt install postgresql-16 postgresql-16-repack
apt purge postgresql-16 postgresql-16-repack postgresql-16-repmgr
- Check whether the installation has been successful, and stop PostgreSQL. At this point, the LSN is saved on the slot; the slot will later resume the publication from this LSN:
pg_lsclusters
systemctl stop postgresql
- Log in as the PostgreSQL user to perform the upgrade. But first, we’ll check if the upgrade is possible:
su postgres
/usr/lib/postgresql/16/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/13/main \
--new-datadir=/var/lib/postgresql/16/main \
--old-bindir=/usr/lib/postgresql/13/bin \
--new-bindir=/usr/lib/postgresql/16/bin \
--old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' \
--new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' \
--check
- Prepare the old instance so that we avoid the errors occurring while creating views and other objects during data migration to the new instance:
pg_dump -s pghero > pghero.sql
DROP EXTENSION pg_stat_statements CASCADE;
DROP SCHEMA pghero CASCADE;
DROP DATABASE pghero;
DROP SCHEMA repmgr CASCADE;
- Peform the actual upgrade by creating hard links to the file inodes and log out as the PostgreSQL user:
/usr/lib/postgresql/16/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/13/main \
--new-datadir=/var/lib/postgresql/16/main \
--old-bindir=/usr/lib/postgresql/13/bin \
--new-bindir=/usr/lib/postgresql/16/bin \
--old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' \
--new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' \
--link
exit
- Change ports for the PostgreSQL instances. It might be also helpful to check these PostgreSQL configs for any deprecated options:
sed -i 's/^port = 5433/port = 5432/' /etc/postgresql/16/main/postgresql.conf
sed -i 's/^port = 5432/port = 5433/' /etc/postgresql/13/main/postgresql.conf
- Start the PostgreSQL service:
systemctl start postgresql
- Log in as the PostgreSQL user again, update the optimizer stats (they are not carried over from the old instance), delete the old PgSQL v13 instance, and log out:
su postgres
/usr/lib/postgresql/16/bin/vacuumdb --all --analyze-in-stages
./delete_old_cluster.sh
rm -rf /etc/postgresql/13/main
exit
- Try to disable and then enable the old subscription to the master database. If that doesn’t help, disable it and create a new one. Important: do NOT delete the old subscription, as this will delete the slot on the master:
ALTER SUBSCRIPTION our_db_sub DISABLE;
ALTER SUBSCRIPTION our_db_sub ENABLE;
- Create a new subscription if the previous step failed:
ALTER SUBSCRIPTION our_db_sub DISABLE;
CREATE SUBSCRIPTION our_db_sub_2 CONNECTION 'host=a.b.c.d dbname=our_db' PUBLICATION our_db_pub WITH (copy_data=false, slot_name='logical_replica_slot', create_slot=false);
- Perform the final checks on your PgSQL servers:
-- master
SELECT usename,client_addr,state,replay_lag FROM pg_stat_replication;
SELECT slot_name,slot_type,active FROM pg_replication_slots;
-- replica
SELECT * FROM pg_stat_subscription;
At this point, we encountered a problem: the number of records on the replica was different from the number of records on the master, i.e., the replica lacked some transactions that had occurred during the PostgreSQL upgrade. We suspected they had been lost while we were switching between subscriptions and looking for the right solution.
We tried to restore the subscription with no success. So we ended up creating a new subscription alongside the old one. However, simply restarting an old subscription also failed for some reason. The charts and logs showed that the logical replica worker was running, but no data was actually being transferred.
Most likely, upgrading the version renewed the version of the pgoutput
logical replication extension as well, which had broken the subscription. So that left us with one final option: create a physical replica and couple the change to the logical replica with a PgSQL version upgrade.
Combining logical replication with version upgrade
The plan was as follows:
- Stop the physical replica.
- Save the LSN.
- Move the logical slot on the master to that LSN.
- Upgrade PgSQL to version 16.
- Create a subscription to the master database on the upgraded PgSQL v16 replica.
We start from the master:
\c our_db
SELECT usename,client_addr,state,replay_lag FROM pg_stat_replication;
SELECT slot_name,slot_type,active FROM pg_replication_slots;
SELECT pg_create_logical_replication_slot('logical_replica_slot', 'pgoutput');
SELECT * FROM pg_replication_slots;
CREATE PUBLICATION our_db_pub FOR ALL TABLES;
Then, we move on to the replica:
\c postgres
SELECT pg_promote();
Here, we will also use the LSN from the PgSQL v13 and advance the slot via pg_replication_slot_advance
:
cat /var/log/postgresql/postgresql-13-main.log | grep "redo done at"
We’ll proceed on the master with the given LSN and drop the old streaming replication slot:
\c our_db
SELECT pg_replication_slot_advance('logical_replica_slot', 'F2E12/8D7CC732');
SELECT pg_drop_replication_slot('repmgr_slot_5');
Then, we’ll come back to the replica to install PostgreSQL v16 and check if it was successful:
apt-get install -y postgresql-16 postgresql-16-repack postgresql-16-repmgr postgresql-client-16
pg_lsclusters
We’re all set to prepare the old instance on our replica:
psql -U postgres -p 5432 << EOF
DROP DATABASE pghero;
\connect our_db
DROP EXTENSION pg_repack CASCADE;
DROP EXTENSION pg_stat_statements CASCADE;
DROP SCHEMA pghero CASCADE;
DROP SCHEMA repack CASCADE;
DROP SCHEMA repmgr CASCADE;
\connect postgres
DROP EXTENSION pg_repack CASCADE;
DROP EXTENSION pg_stat_statements CASCADE;
DROP SCHEMA pghero CASCADE;
DROP SCHEMA repack CASCADE;
DROP SCHEMA repmgr CASCADE;
EOF
Now, it’s time to stop PostgreSQL on the replica, log in as the PostgreSQL user and perform the upgrade:
systemctl stop postgresql
su postgres
/usr/lib/postgresql/16/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/13/main \
--new-datadir=/var/lib/postgresql/16/main \
--old-bindir=/usr/lib/postgresql/13/bin \
--new-bindir=/usr/lib/postgresql/16/bin \
--old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' \
--new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' \
--check
/usr/lib/postgresql/16/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/13/main \
--new-datadir=/var/lib/postgresql/16/main \
--old-bindir=/usr/lib/postgresql/13/bin \
--new-bindir=/usr/lib/postgresql/16/bin \
--old-options '-c config_file=/etc/postgresql/13/main/postgresql.conf' \
--new-options '-c config_file=/etc/postgresql/16/main/postgresql.conf' \
--link
exit
On the replica, edit the config files for v16 and v13:
rsync -av /etc/postgresql/13/main/ /etc/postgresql/16/main/
sed -i '/stats_temp_directory/d' /etc/postgresql/16/main/postgresql.conf
sed -i '/vacuum_defer_cleanup_age/d' /etc/postgresql/16/main/postgresql.conf
sed -i 's/pg_stat_statements,pg_repack/pg_stat_statements/' /etc/postgresql/16/main/postgresql.conf
sed -i 's/\/13\//\/16\//' /etc/postgresql/16/main/postgresql.conf
sed -i 's/^port = 5433/port = 5432/' /etc/postgresql/16/main/postgresql.conf
sed -i 's/13-main/16-main/' /etc/postgresql/16/main/postgresql.conf
sed -i 's/13\/main/16\/main/' /etc/postgresql/16/main/postgresql.conf
sed -i 's/^port = 5432/port = 5433/' /etc/postgresql/13/main/postgresql.conf
… and start the PostgreSQL service, followed by the maintenance operations:
systemctl start postgresql
su postgres
/usr/lib/postgresql/16/bin/vacuumdb --all --analyze-in-stages
./delete_old_cluster.sh
rm -rf /etc/postgresql/13/main
exit
Our final touch on the replica is:
\c our_db
CREATE SUBSCRIPTION our_db_sub CONNECTION 'host=a.b.c.d dbname=our_db' PUBLICATION our_db_pub WITH (copy_data=false, slot_name='logical_replica_slot', create_slot=false);
So we set up a streaming replica and waited for it to synchronize with the main database. Next, we created a logical replication slot and a publication for the target database so that data could be transferred. Then we promoted the replica and moved the logical slot to the LSN. In the end, we upgraded the replica, configured PostgreSQL version 16, and created a subscription on the replica.
And this approach worked! Finally, we got a complete workflow for creating a logical replica on a new PostgreSQL instance from streaming replication.
Transferring the master role to the new replica
The next part of the plan was simple: switching the master role over to the updated replica. Here’s what we will do:
- Execute query plans on the new logical replica to warm up the replica’s cache.
- Set the master to a
read_only
state and await full synchronization with the replica. - Disable the update subscription on the replica to stop replication.
- Configure the load balancers to route traffic to the replica.
- Make sure requests are flowing correctly.
- Delete the subscription.
- Stop the PostgreSQL service on the master.
- Upgrade the old master and get it running as a streaming replica of the new v16 master.
We will start switching the master to the new PostgreSQL from performing VACUUM
on the replica:
vacuumdb --all --analyze-in-stages
Then, on the master, we execute:
CHECKPOINT;
ALTER SYSTEM SET default_transaction_read_only TO on;
SELECT pg_reload_conf();
SHOW default_transaction_read_only;
… and make sure replay_lag
is 0:
SELECT usename,client_addr,state,replay_lag FROM pg_stat_replication;
On the replica:
\c our_db
ALTER SUBSCRIPTION our_db_sub DISABLE;
Now, we need to update cluster load balancers to route traffic to the new master. When it’s done, we can perform a full PostgreSQL check by executing a rollback plan, involving the following commands:
- Executing on the master:
ALTER SYSTEM SET default_transaction_read_only TO off;
SELECT pg_reload_conf();
- Reverting the cluster load balancer to the old master’s address.
- Executing on the replica:
\c our_db
ALTER SUBSCRIPTION our_db_sub ENABLE;
- Ok, let’s move further with the replica:
\c our_db
DROP SUBSCRIPTION our_db_sub;
- Our final actions on the master:
\c our_db
DROP PUBLICATION our_db_pub;
SELECT pg_drop_replication_slot('repmgr_slot_5');
SELECT pg_drop_replication_slot('logical_replica_slot');
exit;
sudo systemctl stop postgresql
The plan played out as intended! The downtime of 10-15 seconds solely affected data writing operations. This brief outage was due to performing a checkpoint and switching traffic from one server to another.
Now, we should note that deleting a subscription also deletes the logical slot. So we first disabled the slot to be able to roll back if necessary. Next, we had to update the SEQUENCE
:
ALTER SEQUENCE id_seq RESTART WITH 1822319;
Why? Well, while most of the tables used a primary key in the UUID v4 format, some relied on an autoincrementing primary key that depended on SEQUENCE
. In those tables, inserting new entries would fail because the SEQUENCE
had not been migrated.
Summary
We successfully upgraded our PostgreSQL cluster from version 13 to 16 with minimal downtime while keeping it highly available. While there were some challenges, such as LSN management and subscription issues, we synchronised the data and completed the upgrade without any data loss. The key to our success was combining the transition to a logical replica with the version upgrade.
It is worth noting that upgrading the PostgreSQL cluster was not merely a technical necessity but a strategic move. We took into account the growth in data and transactions that the old servers would soon no longer be able to handle. On top of that, the upgrade unlocked new features in PostgreSQL 16, which will streamline future development.
Comments