Failover Slots, Two Years On

Logical replication and physical standbys did not get along, for a long time. You could have one or the other surviving a failover, but not both. PostgreSQL 17 finally shipped the machinery to fix this. PostgreSQL 18 did not really add to it. PostgreSQL 19, currently in feature freeze, adds two genuine improvements and one quality-of-life change. The honest question for anyone who runs a primary with a hot standby and one or more logical subscribers is: should I turn this on in production yet?

The answer is yes, with conditions, and the conditions are sharper than the documentation makes them sound.

The problem, briefly

A logical replication slot on the primary tracks the position to which a logical subscriber has consumed changes. That position is stored in the slot. The slot is not part of WAL. It is a piece of in-memory state that gets persisted to a file on disk on the primary, and that file is not shipped to the physical standby by the streaming replication protocol. So the standby has WAL, but it does not have the slot.

If the primary fails and the standby is promoted, every logical subscriber pointed at the old primary is now pointed at a server that has the WAL but no slot. The subscriber does not know what to do; the new primary does not know where the subscriber was. You can recreate the slot, but you have lost the ordering guarantee. There is no good option. This was the situation for years.

The community kept circling around this with extensions — pglogical, the pg_failover_slots extension from EnterpriseDB, BDR’s variants — but the core was missing the plumbing. Someone had to ship it.

What PostgreSQL 17 actually shipped

Three things, and they need to be understood as a system.

First, on the primary, a new subscription option (and slot option) called failover. When you create a subscription with failover = true, the corresponding logical slot on the publisher is marked as a failover slot. That mark is what tells the rest of the machinery to keep this slot in sync with the standby.

Second, on the standby, a new GUC called sync_replication_slots. When set to on, a slotsync worker periodically copies the state of failover slots from the primary to the standby. It is a poll-and-copy, not WAL-based replication. The slot state is fetched out-of-band over a normal libpq connection.

Third, on the primary, a new GUC called synchronized_standby_slots. This lists the physical replication slots of the standbys that must be kept caught up before logical slots are allowed to advance. This is the actual safety mechanism. It does not synchronize anything. It is a holdback. A logical decoder on the primary will not send a change to a logical subscriber until that change has been received (and flushed) by every physical standby listed here. (Early in PostgreSQL 17’s development this GUC was briefly named standby_slot_names. The docs have been correct since; old blog posts and Stack Overflow answers have not.)

Together, these three pieces give you the property that matters: when the standby is promoted, every failover slot on the new primary is at a position that is no later than where the logical subscriber actually is. The subscriber can resume.

The architectural choice is doing a lot of work here

This is not the design that was originally proposed. The original “failover slots” patch wanted slots themselves to be WAL-logged, so that any standby replaying the WAL would naturally have the slot. That design was rejected on grounds of complexity and on the legitimate concern that it bound the slot abstraction tightly to physical replication semantics that do not, in general, apply to logical decoding.

What shipped is the inverse: the slot state is copied asynchronously, and the primary is responsible for not letting logical consumers run ahead of the slowest physical standby. This is the right call architecturally. It is also the reason for almost all of the operational sharp edges. You are running two replication channels at different cadences, and the only thing keeping them honest is a holdback parameter on the primary.

Where it goes wrong in production

Three things to watch.

The slot copy is asynchronous, full stop. The standby fetches slot state on an interval. If the primary dies between two fetches, the slot on the standby is behind the primary’s last known position. That is fine, because synchronized_standby_slots made sure the logical subscriber is also behind that position. But “fine” here means “no data loss”; it does not mean “instant failover.” Before promoting, you must verify that the failover slots on the standby have caught up to where the logical subscriber needs them to be. The way to do that is to call pg_sync_replication_slots() manually on the standby and confirm the slot LSNs against the subscribers’ positions. If you skip this step in your failover automation, you have built a system that is correct in the average case and quietly broken in the failure case.

synchronized_standby_slots is not the same as synchronous_standby_names. They have similar names and they do related things, but they are not the same parameter and they do not stack the way you might assume. synchronous_standby_names controls whether commits wait for standbys. synchronized_standby_slots controls whether logical decoders wait for standbys. You can have one without the other. If you want logical slots to be safe across failover but you do not want synchronous commit, that is the configuration to deploy. If you set synchronous_standby_names and assume your logical slots are also covered, you are wrong.

You pay for it in throughput. Every change that gets sent to a logical subscriber is held until it is flushed on the physical standby. If your physical replication is healthy, the cost is the network round-trip plus the standby’s flush latency, which is small. If your physical replication is degraded — slow disk on the standby, network blip, anything — your logical subscriber slows down with it. This is the correct behavior, but it is a coupling that did not exist before. Monitor lag on both channels.

What PostgreSQL 18 added

For this specific feature: nothing of consequence. PostgreSQL 18’s headline replication work was elsewhere, mostly around the asynchronous I/O subsystem and incremental improvements to logical decoding internals. The failover-slot story in 18 is the failover-slot story in 17, with a year of bug-fix patches behind it. That year of patches is, frankly, the more important thing. The feature was new in 17; it is now seasoned.

What PostgreSQL 19 adds

Two real changes and one quality-of-life change, all visible in Bruce Momjian’s release notes draft and in the January 2026 commitfest items.

The first is dynamic wal_level. Today, you must restart the server to change wal_level between replica and logical. In PostgreSQL 19, the effective WAL level is determined by whether any logical slot currently exists. Create the first logical slot, the effective level rises to logical. Drop the last one, it falls back to replica. A new read-only GUC, effective_wal_level, exposes the current state. This is a much bigger deal than it looks. It removes one of the stupidest restart requirements in the entire WAL system, and for shops that occasionally need logical replication for a migration but otherwise run at replica, it eliminates a planned outage.

The second is slotsync_skip_reason, a new column on pg_replication_slots that records, for failover slots, why the most recent sync attempt did not advance the slot. If your standby’s slot is stuck behind, you no longer have to grep server logs to find out whether the standby was lagging, the slot was inactive, or the primary was holding back. This is the column you will end up checking in production at 2 AM, and you will be glad it exists.

The quality-of-life change is in EXPLAIN’s new IO option, which is unrelated to slots but which interacts: you can now see which I/O on the primary is being spent in WAL flush waits induced by synchronized_standby_slots. That makes the throughput cost of the holdback measurable, instead of an unspecified fraction of “things are slow.”

What to do

If you run logical replication and you also run a physical standby, turn this on. Set sync_replication_slots = on on every standby that might be promoted. Set synchronized_standby_slots on the primary to the physical replication slots of those standbys. Create your subscriptions with failover = true, or alter the underlying slots to set the failover flag if they already exist. Then change your failover runbook so that promotion does not happen until the standby has confirmed slot positions against the subscribers it is about to take over.

If your failover is fully automated through Patroni or a similar tool, check that the version you are running understands failover slots and gates promotion on slot sync. Older versions do not. Patroni added support, but you have to be on a recent enough release for it to matter, and you have to actually configure it.

Wait for PostgreSQL 19 if you can; the dynamic wal_level and slotsync_skip_reason are not strictly required, but they remove enough friction that the question “should I deploy this in production” turns from “yes, with caveats” into just “yes.” If you cannot wait, deploy on 18, monitor lag on both channels, and treat your runbook as the load-bearing piece of the system. Because it is.