The Build


Three Steps to pg_rewind Happiness

9 August 2018

pg_rewind is a utility included with PostgreSQL since 9.x. It’s used to “rewind” a server so that it can be attached as a secondary to a primary. The server being rewound could be the former primary of the new primary, or a secondary that was a peer of the new primary.

In pg_rewind terminology, and in this post, the “source” server is the new primary that the old server is going to be attached to, and the “target” is the server that will be attached to the source as a secondary.

Step One: Have a WAL Archive

While pg_rewind does not require that you have a WAL archive, you should have one. pg_rewind works by “backing up” the target server to a state before the last shared checkpoint of the two servers. Then, when the target starts up, it uses WAL information to replay itself to the appropriate point at which it can connect as a streaming replica to the source. To do that, it needs the WAL information from the rewind point onwards. Since the source had no reason to “know” that it would be used as a primary, it may not have enough WAL information in its pgxlog / pgwal directory to bring the target up to date. If it doesn’t, you are back to rebuilding the new secondary, the exact situation that pg_rewind is meant to avoid.

Thus, make sure you have a WAL archive that the target can consult as it is coming up.

Step Two: Properly Promote the Source Server

The source server, which will be the new primary, needs to be properly promoted. Use the pg_ctl promote option, or the trigger_file option in recovery.conf so that the source promotes itself, and starts a new timeline. Don’t just shut the source down, remove recovery.conf, and bring it back up! That doesn’t create a new timeline, and the source won’t have the appropriate divergence point from the target for pg_rewind to consult.

Step Three: Wait for the Forced Checkpoint to Complete

When a secondary is promoted to being a primary, it starts a forced checkpoint when it exits recovery mode. This checkpoint is a “fast” checkpoint, but it can still take a while, depending on how big shared_buffers is and how many buffers are dirty. Use tail -f to monitor the logs on the source and wait for that forced checkpoint to complete before running pg_rewind to rewind the target. Failing to do this can cause the target to be corrupted. If you are writing a script to do this, issue a CHECKPOINT statement to the source before running pg_rewind.

And have fun rewinding servers!