All Your GUCs in a Row: bgwriter_delay and bgwriter_flush_after

The B cluster shifts gears: from one-off oddities to the background writer parameters, which span four GUCs. We do the first two as a pair because bgwriter_delay introduces the process at all, and bgwriter_flush_after slots cleanly into the writeback tour from backend_flush_after.

What the background writer does

The bgwriter is a PostgreSQL auxiliary process whose job is to write dirty pages out of shared buffers before a backend needs to evict them. The motivating problem: when a backend wants a buffer slot but every candidate is dirty, the backend itself has to write a dirty page first, which means a user query is now blocked on disk I/O. The bgwriter preempts that by cleaning pages near the eviction frontier in the background, so backends find clean slots waiting for them.

The relevant statistics live in pg_stat_bgwriter. The diagnostic comparison is between buffers_clean (pages the bgwriter wrote), buffers_checkpoint (pages the checkpointer wrote at checkpoint time), and buffers_backend (pages a user backend had to write itself because no clean buffer was available). When buffers_backend is high relative to buffers_clean, the bgwriter isn’t keeping up and your user queries are eating the latency. That is the metric that should drive any tuning of the parameters below.

`bgwriter_delay`

Sets how often the bgwriter wakes up to do its work. Default 200ms, range 10ms to 10000ms, settable only in postgresql.conf or on the command line (sighup to take effect, no per-session override). On systems with 10ms sleep resolution, values not a multiple of 10 round up; this is a kernel detail and doesn’t usually matter.

The default means the bgwriter runs five times per second. That sounds modest, and on most workloads it is — the bgwriter’s per-cycle limit (governed by bgwriter_lru_maxpages, next post) does more to set the total work rate than the delay does. The combination of defaults produces a ceiling of about 4 MB/sec of dirty-page writeback from the bgwriter, which is intentionally conservative because the docs warn — accurately — that the bgwriter can write the same page multiple times per checkpoint interval if the page keeps getting re-dirtied. Aggressive bgwriter tuning on a workload with hot counter-style tables means writing the same page over and over before the checkpointer would have written it once. Pure I/O waste.

Lowering bgwriter_delay to 100ms or 50ms makes the writer run more often, smoothing its I/O into smaller, more frequent bursts. Useful on systems where you see brief stalls correlated with bgwriter activity. On well-behaved workloads with reasonable shared_buffers sizing, leaving it at 200ms is correct.

`bgwriter_flush_after`

This is one of the four *_flush_after parameters — same mechanism, same sync_file_range()-based writeback hint, same kernel-page-cache motivation explained in detail in backend_flush_after. The summary: when the bgwriter has written more than this many bytes since the last hint, it asks the kernel to start writing those specific dirty buffers back to disk rather than letting them accumulate in the page cache.

Default is 512kB on Linux, 0 elsewhere. Range 0 to 2MB. Context same as bgwriter_delay.

The 512kB Linux default is sensible and almost always the right answer. Raise it (toward 1MB) on systems with very fast NVMe storage where larger writeback batches don’t cause stalls; lower it to 256kB or 128kB on systems where bgwriter activity correlates with brief query-time hiccups. Setting it to 0 disables writeback hints entirely, leaving the kernel to manage dirty-page flush on its own schedule — which on a typical Linux box with default vm.dirty_* settings means accumulating gigabytes of dirty data and dumping it at fsync time, which is the exact failure mode the parameter was added to mitigate.

Tuning, jointly

The honest answer for most readers: leave both at the default and tune bgwriter_lru_maxpages and bgwriter_lru_multiplier first (next post). Those have far higher leverage on bgwriter throughput. bgwriter_delay controls cadence, not capacity; halving it doesn’t double the work the bgwriter can do, it just spreads the same throughput ceiling over twice as many cycles. bgwriter_flush_after is a writeback hint whose default is correct on the OS most people are running.

Recommendation: Leave bgwriter_delay at 200ms and bgwriter_flush_after at 512kB (Linux) unless pg_stat_bgwriter shows specific symptoms. The symptoms that justify tuning here are: high buffers_backend relative to buffers_clean (lower the delay, but also see the next post), or visible checkpoint-time latency spikes correlated with kernel writeback (raise or lower bgwriter_flush_after based on storage characteristics). On modern NVMe-backed servers neither parameter is typically the bottleneck.