All Your GUCs in a Row: data_sync_retry

data_sync_retry is a boolean, it defaults to off, and its context is postmaster so changing it needs a restart. You will almost certainly never change it. It exists as the visible scar tissue from the single most unsettling thing the PostgreSQL community ever learned about its own durability assumptions, and to explain a one-line setting we have to explain fsyncgate.

The contract PostgreSQL thought it had

PostgreSQL, like most databases, writes data to disk in two steps. First it write()s dirty pages, which copies them into the kernel’s page cache — fast, but not durable; the data is in memory the kernel intends to flush eventually. Then, at checkpoint time, it calls fsync() on the file, which is supposed to force every outstanding dirty page for that file down to durable storage and not return until they’re there.

The contract, as everyone understood it for decades, was simple. fsync() returns success: your data is on disk, and the WAL that describes those changes can be recycled. fsync() returns failure: the write didn’t happen, the dirty pages are still sitting in the cache, and you can retry the fsync() later to try again. PostgreSQL’s checkpoint logic was built squarely on that second clause — on the assumption that a failed fsync() is a retryable condition, because the data it failed to flush is still there to be flushed.

What the kernel actually did

When the Linux kernel tried to write a dirty page back to the device and the write failed — bad sector, thin-provisioned volume out of space, a SAN that briefly vanished — it had to decide what to do with the page it couldn’t write. Keeping it dirty forever means a page of memory that can never be reclaimed, and on a busy system with a persistently failing device that’s an unbounded leak the kernel cannot permit. So Linux did the pragmatic thing: it marked the page clean, recorded that an error had occurred, and reported that error on the next fsync() call for the file — exactly once. After that, the error was forgotten and the page was eligible for reuse like any other clean page.

Worse, in the kernels of the time the error was delivered to whichever file descriptor happened to call fsync() first, and only that one. PostgreSQL’s architecture has the checkpointer doing the flushing, and the checkpointer routinely opens files fresh rather than holding every descriptor open forever. So the error could be consumed and discarded by some other backend’s fsync(), or lost when a descriptor was closed, and the checkpointer’s subsequent fsync() would return a clean, cheerful success.

Now assemble the failure. A checkpoint runs. A writeback error occurs and is reported once, to the wrong file descriptor, then forgotten. The dirty page is marked clean and reused. PostgreSQL’s checkpointer fsync()s, gets success, concludes the checkpoint is complete, and recycles the WAL segments that described those writes. The data was never written to disk, the page cache copy is gone, and now the WAL copy is gone too. As Andrew Gierth put it in the original discussion, the data at that point is “gone from the known universe” — there is no copy of it anywhere, and no retry can recover what no longer exists.

fsyncgate

Craig Ringer brought this to pgsql-hackers in early 2018, and the realization landed hard: PostgreSQL’s handling of fsync() errors was unsafe, had been unsafe for the project’s entire history, and — because the write-then-fsync pattern is universal — the same trap sat under MySQL, MongoDB, and essentially every other database that used buffered I/O. It got a name, fsyncgate, an LWN writeup, a PgCon 2018 talk by Linux kernel developer Matthew Wilcox on the kernel side of it, and a series of pointed conversations between database and kernel developers, some in person at that year’s Linux storage and filesystem summit. (Matthew’s presentation was packed with just about the entire convention attendance, and his dry British “we apologize for any inconvenience” was met with a roar of friendly laughter.)

The kernel developers’ position was, from their point of view, defensible: the page cache is the kernel’s to manage, and an application that wants guarantees about specific bytes on specific devices is asking for something buffered I/O was never designed to promise. The database developers’ position was that the documented fsync() contract had quietly stopped meaning what everyone built on.

The fix, from both sides

The kernel side came first and was narrower than it sounds. Starting in Linux 4.13, Jeff Layton’s errseq_t work attached a 32-bit error-and-sequence value to each inode’s address_space, so that a writeback error is visible to every descriptor that fsync()s afterward, not just the first. That closed the “wrong file descriptor ate the error” hole — but only the reporting hole. It does not bring the data back; a reliably-reported error about lost data is still lost data. And it has its own gap: if the inode is evicted under memory pressure before anyone reopens the file, the recorded error goes with it.

PostgreSQL’s side is data_sync_retry, and it is brutally simple. Thomas Munro committed the change in November 2018 — authored by Craig Ringer — and it was back-patched to every supported release. The new behavior: if fsync() on a data file fails, do not retry, do not try to reason about it. PANIC. Crash the server, and let crash recovery replay from the last checkpoint — because the WAL still describes the lost writes, the checkpoint that would have recycled it never completed, and replaying the WAL reconstructs exactly the data that the failed fsync() lost. Retrying is the one thing you must never do, because a retry can return success against a page that no longer exists. Crashing and recovering from the WAL is, counterintuitively, the safe response. PostgreSQL had always done this for WAL files; the 2018 change extended it to ordinary data files.

data_sync_retry = off, the default, is that PANIC behavior. Setting it to on restores the old retry-and-continue logic, and the only systems where that is safe are ones whose kernels are known to keep dirty pages after a writeback failure rather than dropping them — which Linux, emphatically, does not. The docs are unusually direct that on such an OS a retried fsync() “may be reported as successful, when in fact the data has been lost.” If you are on an operating system with different, dirty-page-retaining semantics, and you genuinely know that to be true of your kernel, on is available to you. For everyone on Linux, on is a setting whose only honest description is “please lose my data quietly instead of loudly.”

There is still a narrow residual window even with the default — the file-closed-then-reopened-then-inode-evicted race the kernel errseq_t work also can’t fully close. PostgreSQL 12 addressed the PostgreSQL half of it with machinery to keep files with dirty data open continuously and hand their descriptors to the checkpointer, which was judged too invasive to back-patch. The genuinely complete fix that everyone agrees on — Dave Chinner, Ted Ts’o, and the PostgreSQL developers alike — is Direct I/O for the main data path, bypassing the page cache entirely so the database owns its own writeback and error handling. Andres Freund’s assessment of that work was “a metric ton of work,” and as of 2026 it still has not fully landed.

So leave it off. There is no tuning here, only a default that encodes a hard-won lesson. Do one thing with this knowledge instead: alert on PANIC in your logs, because a checkpoint-time fsync PANIC is the database telling you, in the only way it can, that the storage layer just admitted to lying — and then go find the faulty hardware or the shared filesystem that doesn’t actually honor fsync(), because the PANIC was the symptom, not the disease.

The contract PostgreSQL thought it had

What the kernel actually did

fsyncgate

The fix, from both sides

Related