18 July 2018
pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, there was no entirely safe way of doing so without simply rebuilding the secondaries using
pg_basebackup (for example), which could take a very long time on large databases.
pg_rewind works by connecting to (or having direct file-system level access to) the new primary, and uses the WAL information to “back up” the target system (the old master or old peer secondaries) to the point that they can reattach as secondaries to the new primary primary.
It works wonderfully… but like anything powerful, it has some warnings associated with it. Here’s one of them.
In a recent situation, a client was doing a flip-back-and-forth stress test, in which a secondary would be promoted, its former primary rewound, the two reattached, and then back again. This worked well for many iterations, but after one particularly fast iteration, the new secondary (former primary, now rewound) wouldn’t come back up; the message was:
requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103
When PostgreSQL exits recovery, it starts a forced checkpoint; you can see this in the logs with a message:
checkpoint starting: force database system is ready to accept connections
Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.
The hypothesis is (not 100% confirmed, but seems likely) that the
pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the
pg_control file, and one of the first things that
pg_rewind does is read the control file from the source machine, there’s a window where the
pg_control file on disk will be out of date. The result is a rather confused situation on the new secondary.
Thus, if you are going to do a
pg_rewind, make sure the initial checkpoint on the source machine has completed before doing the rewind. In a scripted environment, this can be tricky, since the initial checkpoint can take any amount of time depending on how much has to be flushed out to disk. One option is to issue a
CHECKPOINT statement to the newly-promoted primary to make sure a checkpoint has been completed; this does a fast checkpoint, and doesn’t return until the checkpoint is completed. (Edited to recommend
(Thanks to Michael Paquier and Andres Freund for pointing me in the right direction here.)