The Build


pg_rewind and checkpoints: caution!

18 July 2018

pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, there was no entirely safe way of doing so without simply rebuilding the secondaries using pg_basebackup (for example), which could take a very long time on large databases.

pg_rewind works by connecting to (or having direct file-system level access to) the new primary, and uses the WAL information to “back up” the target system (the old master or old peer secondaries) to the point that they can reattach as secondaries to the new primary primary.

It works wonderfully… but like anything powerful, it has some warnings associated with it. Here’s one of them.

In a recent situation, a client was doing a flip-back-and-forth stress test, in which a secondary would be promoted, its former primary rewound, the two reattached, and then back again. This worked well for many iterations, but after one particularly fast iteration, the new secondary (former primary, now rewound) wouldn’t come back up; the message was:

requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103

What happened?

When PostgreSQL exits recovery, it starts a forced checkpoint; you can see this in the logs with a message:

checkpoint starting: force
database system is ready to accept connections

Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.

The hypothesis is (not 100% confirmed, but seems likely) that the pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the pg_control file, and one of the first things that pg_rewind does is read the control file from the source machine, there’s a window where the pg_control file on disk will be out of date. The result is a rather confused situation on the new secondary.

Thus, if you are going to do a pg_rewind, make sure the initial checkpoint on the source machine has completed before doing the rewind. In a scripted environment, this can be tricky, since the initial checkpoint can take any amount of time depending on how much has to be flushed out to disk. One option is to issue a CHECKPOINT statement to the newly-promoted primary to make sure a checkpoint has been completed; this does a fast checkpoint, and doesn’t return until the checkpoint is completed. (Edited to recommend CHECKPOINT.)

(Thanks to Michael Paquier and Andres Freund for pointing me in the right direction here.)

Michael Banck at 00:36, 19 July 2018:

Why not just issue a ‘CHECKPOINT;’ command instead of pg_{start,stop}_backup()?

Vladimir at 09:24, 19 July 2018:

Hm, why not just call CHECKPOINT on newly promoted host? It will return when checkpoint completes.

Xof at 14:08, 22 July 2018:

CHECKPOINT is a better choice there, thanks! (I had for some reason falsely remembered that CHECKPOINT just signals the checkpointer, rather than waiting for it to return.)