wal_sender_shutdown_timeout: Now Actually a Timeout

If you have ever run pg_ctl stop -m fast on a primary and watched it hang well past wal_sender_shutdown_timeout, you have met a bug that has been sitting in walsender.c for years. As of commit c0b24b3 on master (Fujii Masao, May 1, reported by Andres Freund via FreeBSD CI), it is fixed. PostgreSQL 19 will enforce the timeout. PostgreSQL 18 and earlier will not — there is no back-patch.

(I take this bug a bit personally, as I’ve run into it multiple times in the PostgreSQL community infrastructure.)

The bug

When walsender is told to shut down — fast or smart, doesn’t matter — and it has finished streaming WAL, it sends a CommandComplete to the receiver to mean we are done here. Until this commit, the relevant code in WalSndDone() was, in its entirety:

1 SetQueryCompletion(&qc, CMDTAG_COPY, 0);
2 EndCommand(&qc, DestRemote, false);
3 pq_flush();

Both EndCommand() (which internally calls pq_putmessage()) and pq_flush() are blocking calls. They wait, however long it takes, for the kernel send buffer to drain so the data can be queued or pushed out. If the standby has crashed, has been network-partitioned, or has simply stopped reading the socket — which, you will note, are exactly the situations in which you are most likely to be running pg_ctl stop in the first place — those calls happily wait forever.

The result is that wal_sender_shutdown_timeout (default 5 minutes) is silently bypassed. Postmaster sits there waiting for walsender. Walsender sits there waiting for a socket that is not coming back. The cluster does not shut down. Your runbook calls for immediate-mode shutdown, you do that, you eat crash recovery on the next startup. This is not a hypothetical; this is what most replicated shops have been quietly working around.

The lower bound on how long this hang lasts is “until TCP retransmission gives up,” which on Linux defaults to roughly tcp_retries2 = 15, or about 15 minutes. The upper bound is “until something else kills the process.” Neither is what wal_sender_shutdown_timeout claims to provide.

Why `pq_flush()` was the wrong tool

Walsender already has a perfectly serviceable nonblocking flush idiom. It is used everywhere else in the file: poll pq_is_send_pending(), sleep on WalSndWait(WL_SOCKET_WRITEABLE, …), call pq_flush_if_writable(), check the shutdown timeout each loop. The shutdown path was just out of sync with the rest of the module — someone wrote the obvious thing (pq_flush) instead of the right thing.

The reason the obvious thing is wrong is that pq_flush() does not know about wal_sender_shutdown_timeout, wal_sender_timeout, or any other deadline. It is the unconditional synchronous flush. In a process whose entire job during shutdown is to enforce a deadline against a possibly-dead peer, calling an unconditional synchronous flush is exactly the call you must not make.

The fix

Two pieces.

First, in src/backend/tcop/dest.c, the existing EndCommand() is renamed EndCommandExtended() and gains a bool noblock argument. When noblock is true, it uses pq_putmessage_noblock() instead of pq_putmessage(). The old EndCommand() becomes a thin wrapper that calls EndCommandExtended(..., false), so every existing caller is unaffected:

1 void
2 EndCommandExtended(const QueryCompletion *qc, CommandDest dest,
3                    bool force_undecorated_output, bool noblock)
4 {
5     ...
6     if (noblock)
7         pq_putmessage_noblock(PqMsg_CommandComplete, completionTag, len + 1);
8     else
9         pq_putmessage(PqMsg_CommandComplete, completionTag, len + 1);
10     ...
11 }

Second, in src/backend/replication/walsender.c, both WalSndDone() and WalSndDoneImmediate() now call EndCommandExtended(..., true) to queue the message without blocking, set a new file-static shutdown_stream_done_queued so the message is not enqueued twice, and replace the bare pq_flush() with the standard walsender flush loop:

1 for (;;)
2 {
3     long sleeptime;
4 
5     WalSndCheckShutdownTimeout();
6 
7     if (!pq_is_send_pending())
8         break;
9 
10     sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
11 
12     WalSndWait(WL_SOCKET_WRITEABLE, sleeptime,
13                WAIT_EVENT_WAL_SENDER_WRITE_DATA);
14 
15     ResetLatch(MyLatch);
16     CHECK_FOR_INTERRUPTS();
17 
18     if (pq_flush_if_writable() != 0)
19         WalSndShutdown();
20 }

WalSndCheckShutdownTimeout() is called at the top of every iteration, before computing the sleep duration, so the deadline is consulted both as the wake condition and as the loop guard. If the timeout has already expired, the loop exits via WalSndCheckShutdownTimeout()’s proc_exit() and the walsender goes away. If pq_flush_if_writable() returns nonzero (i.e. the connection is broken), WalSndShutdown() cleans up immediately. If the buffer drains, the loop falls out and proc_exit(0) runs normally.

There is also a one-line subtlety worth noting: last_reply_timestamp is reset to zero before entering the flush loop. This disables wal_sender_timeout during shutdown so that that timer does not fire mid-flush and produce a confusing log entry. Only wal_sender_shutdown_timeout matters once you are in shutdown mode.

What this means in practice

If you are on PostgreSQL 19 (when it ships): wal_sender_shutdown_timeout is now enforced even when a standby has gone silent. pg_ctl stop -m fast will return within the timeout, and walsender will log its forced exit rather than hanging in pq_flush().

If you are on PostgreSQL 18 or earlier: nothing has changed. The commit message includes no Backpatch-through: line, and the patch touches a public function signature in dest.h, so back-patching it would be ABI-disruptive. Your operational workaround remains the same: if pg_ctl stop exceeds a sensible bound, escalate to -m immediate and accept the recovery.

The deeper lesson here is the one that always lurks behind these “obvious” bugs: synchronous calls inside processes that have explicit deadline-enforcement responsibilities are almost always wrong, and the asymmetry between the easy synchronous API and the slightly-more-typing nonblocking API is exactly why these bugs accumulate. Walsender gets it right almost everywhere; it just had this one stale corner. Now it doesn’t.

1	SetQueryCompletion(&qc, CMDTAG_COPY, 0);
2	EndCommand(&qc, DestRemote, false);
3	pq_flush();

1	void
2	EndCommandExtended(const QueryCompletion *qc, CommandDest dest,
3	bool force_undecorated_output, bool noblock)
4	{
5	...
6	if (noblock)
7	pq_putmessage_noblock(PqMsg_CommandComplete, completionTag, len + 1);
8	else
9	pq_putmessage(PqMsg_CommandComplete, completionTag, len + 1);
10	...
11	}

1	for (;;)
2	{
3	long sleeptime;
4
5	WalSndCheckShutdownTimeout();
6
7	if (!pq_is_send_pending())
8	break;
9
10	sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
11
12	WalSndWait(WL_SOCKET_WRITEABLE, sleeptime,
13	WAIT_EVENT_WAL_SENDER_WRITE_DATA);
14
15	ResetLatch(MyLatch);
16	CHECK_FOR_INTERRUPTS();
17
18	if (pq_flush_if_writable() != 0)
19	WalSndShutdown();
20	}

The bug

Why pq_flush() was the wrong tool

The fix

What this means in practice

Related

Why `pq_flush()` was the wrong tool