PREEMPT_NONE Is Dead; Your Postgres Probably Doesn’t Care

A blue elephant and a penguin are examining a large pile of electronics in dismay.

A benchmark came out of AWS earlier this month showing PostgreSQL throughput on Linux 7.0 dropping to 0.51x what the same workload produced on Linux 6.x. The Phoronix headline wrote itself. Hacker News did what Hacker News does. By the end of the week, I had been asked by three separate clients whether they needed to hold their kernel upgrades.

They don’t. Almost nobody does. The regression is real, but it’s a narrow, loud artifact of a benchmark configuration that was already misconfigured for a 96-vCPU box with 100+ GB of shared memory. The headline undersells how much this is a “don’t do that” story and oversells how much this is a Linux-broke-Postgres story.

Let me walk through what actually happened, because the explanation is interesting on its own merits — it touches the scheduler, the TLB, page faults, and the one spinlock in Postgres that nobody outside the buffer manager thinks about. And it ends where a lot of Postgres performance stories end: with huge pages.

Lætitia Avrot wrote her own clear-eyed walkthrough of the same regression on My DBA Notebook on April 15, and if you read only one piece on this, read hers. What follows is my version of the story, with more time spent on the mechanism and a few more diagrams.

What Linux 7.0 actually changed

Before Linux 7.0, you could build or boot a kernel in one of three preemption modes:

PREEMPT_NONE — the kernel almost never interrupts a running userspace thread. Thread gets its slice, uses its slice, yields on syscall or sleep. This is what you historically wanted on a server: batch-throughput-friendly, minimum context-switch overhead.
PREEMPT_FULL — the kernel can interrupt userspace at almost any safe point. Low latency, lots of context switches, historically the desktop default.
PREEMPT_LAZY — a newer middle ground. The scheduler can interrupt, but will wait for “natural” boundaries when it can.

Linux 7.0, via Peter Zijlstra’s preemption-cleanup series, removed PREEMPT_NONE entirely on arm64, x86, powerpc, riscv, s390, and loongarch. What you get now is PREEMPT_FULL or PREEMPT_LAZY. On most distros the default shifted to PREEMPT_LAZY.

For nearly every workload this is fine. PREEMPT_LAZY is supposed to approximate PREEMPT_NONE behavior under throughput-oriented loads. Most of the time it does. The exception is when a userspace thread enters a critical section where getting preempted is catastrophic — and then stays in that critical section just a little longer than the scheduler expects.

Spinlocks, in other words.

The benchmark

Salvatore Dipietro at AWS posted the regression to LKML on April 3 as a one-patch series titled “sched: Restore PREEMPT_NONE as default.” The setup:

m8g.24xlarge — 96 vCPU Graviton4 — running Amazon Linux 2023.
Kernel: next-20260331, a linux-next snapshot, with and without a revert of commit 7dadeaa6e851 (“sched: Further restrict the preemption modes”), which is the 7.0-rc1 change that removed PREEMPT_NONE on arm64.
Storage: 12× 1 TB io2 volumes at 32,000 IOPS each, RAID0, XFS.
PostgreSQL 17, pgbench simple-update, 1,024 clients, 96 threads, prepared protocol, scale factor 8,470, fillfactor=90, 1,200 second runs.

The results, averaged over three runs each:

Configuration	Avg tps	Ratio
Baseline (linux-next, PREEMPT_LAZY)	50,751.96	1.00x
With `7dadeaa6e851` reverted	98,565.86	1.94x

The baseline is the one that got the headline. Reverting the preemption-mode change nearly doubles throughput. Stated the other direction: on this workload, Linux 7.0 delivers 0.51x.

perf showed the regression sitting, almost undiluted, in a single call chain:

1 |- 56.03% - StartReadBuffer
2    |- 55.93% - GetVictimBuffer
3       |- 55.93% - StrategyGetBuffer
4          |- 55.60% - s_lock          <<<< 55% of CPU
5          |- 0.08%  - LockBufHdr
6          |- 0.07%  - hash_search_with_hash_value

More than half of the machine’s CPU time is being burned on one userspace spinlock, which is a very specific and very telling place for the hot spot to land.

Why that spinlock, and why now

StrategyGetBuffer is the function Postgres calls when a backend needs a buffer and doesn’t already have one. It serializes on one spinlock — StrategyControl->buffer_strategy_lock — in two cases. On a cold buffer pool, it pops from the freelist. Once the freelist drains, it runs the clock sweep, advances nextVictimBuffer, and returns a candidate. Both paths take the same spinlock. On a 96-vCPU machine with 1,024 clients, any serialization point will get loud, but this one has a specific property: the section protected by the spinlock can, under the wrong configuration, take a minor page fault.

That’s the piece that turns a bad-but-tolerable serialization point into a 55%-of-CPU catastrophe, and it’s where the kernel change matters. The contention is real at any parallelism this high. The question is why it got twice as bad under PREEMPT_LAZY. The answer — as Andres Freund worked out on -hackers and on Hacker News — is not the scheduler, or not directly.

The actual culprit: minor page faults inside a spinlock

This is the part that’s worth slowing down for.

With huge_pages=off, Postgres’s shared memory is mapped with ordinary 4 KB pages. A 120 GB shared_buffers is, in PTE terms, roughly 31 million pages. Every one of those pages, on first touch, causes a minor page fault — the VM subsystem has to wire up a physical page and install a PTE. That minor fault takes microseconds, which is forever in spinlock terms. And a 1,200-second benchmark at scale factor 8,470 will keep touching previously-unmapped pages throughout the run, not just during the first few seconds: pgbench’s uniform-random access pattern against a 127 GB pgbench_accounts table means new pages keep entering the working set for a long time.

Now consider the sequence on the hot path. A backend holds buffer_strategy_lock. To pop a buffer off the freelist — or to advance the clock sweep against a buffer whose header sits on a page the backend hasn’t touched — it has to read or write shared memory that hasn’t been faulted in. That access takes a minor page fault. The spinlock holder is now stalled in the kernel fault handler. Every other backend — dozens or hundreds of them on a 96-vCPU box under 1,024 clients — is spinning in userspace, burning CPU, waiting.

sequenceDiagram participant A as Backend A (lock holder) participant K as Linux kernel (VM + scheduler) participant B as Backends B..N (spinning, ×1000) A->>A: acquire buffer_strategy_lock A->>A: touch unmapped 4 KB page A->>K: minor page fault Note over K: allocate physical frame install PTE (~µs) B->>B: s_lock() spin B->>B: s_lock() spin B->>B: s_lock() spin Note over B: 1,000+ CPUs burning cycles waiting for A K-->>A: PTE installed, resume A->>A: finish critical section A->>A: release lock B->>B: one backend acquires, rest keep spinning

Now layer PREEMPT_LAZY on top. PREEMPT_NONE was never going to preempt Backend A while it was holding the spinlock; it just wasn’t going to preempt anybody in userspace that didn’t ask for it. PREEMPT_LAZY might. When it does, the spinlock hold time balloons from “microseconds of page-fault service” to “microseconds of page-fault service plus however long it takes the scheduler to hand control back to the holder.” The queue of spinners grows. The wasted CPU time compounds.

The preemption mode change isn’t creating the bug. It’s making the pre-existing bug — taking page faults inside a spinlock while mapped with 4 KB pages against 100+ GB of shared memory under absurd parallelism — visible by a factor of two.

Why huge pages make the problem disappear

Here is the TLB density story in one picture — the same 120 GB of shared_buffers, mapped three different ways:

flowchart LR S["shared_buffers = 120 GB"] S --> A["4 KB pages ~31,457,280 PTEs ~31M possible first-touch faults TLB thrashes under load"] S --> B["2 MB huge pages ~61,440 PTEs ~61K first-touch events TLB fits the working set"] S --> C["1 GB huge pages 120 PTEs ~120 first-touch events TLB is effectively free"] classDef small fill:#fdd,stroke:#933,stroke-width:1px classDef mid fill:#ffd,stroke:#a90,stroke-width:1px classDef big fill:#dfd,stroke:#393,stroke-width:1px class A small class B mid class C big

Two things happen when you switch to huge pages. First, the number of faults on first touch goes from ~31 million to ~61,000 or ~120 — because the kernel maps the whole huge page on the first access to any byte in it, and those mappings are usually done at mmap time anyway when using MAP_HUGETLB. The cold-start fault storm evaporates. Second, TLB pressure drops by four to six orders of magnitude, so the buffer-manager hot path actually stays hot in the TLB instead of thrashing it.

Freund’s own finding, when he tried to reproduce: with huge_pages=on (or even THP doing the right thing), the regression does not appear. With huge_pages=off, on the same hardware, the regression appears. The variable is not the kernel. The variable is the shared-memory mapping. Dipietro’s patch proposes to restore PREEMPT_NONE as the default. Freund’s response is essentially: the kernel is doing what you asked it to; your shared_buffers is mapped wrong.

flowchart LR A[pgbench starts cold buffer pool] --> B{huge_pages?} B -->|on / THP| C[few faults dense TLB] B -->|off| D[~31M minor faults thrashed TLB] C --> E[spinlock holder never stalls on fault] D --> F[spinlock holder stalls on fault] F --> G{PREEMPT mode?} G -->|NONE Linux 6.x| H[holder resumes quickly: ~0.51x stays ~1x] G -->|LAZY Linux 7.0| I[holder may get scheduled out → 0.51x] E --> J[steady-state tps unaffected by kernel]

The benchmark walked straight into the narrow scenario where the Linux 7.0 change matters: huge shared memory, small pages, a working set that keeps pulling new pages in, and enough parallelism to keep the spinlock queue saturated. Change any one of those variables and the regression doesn’t reproduce.

The kernel community’s counter-proposal

The suggested fix from the kernel side — have Postgres use rseq-based time-slice extensions to tell the scheduler “please don’t preempt me right now” — arrived with the confidence of someone who has never shipped a database. Freund’s response was measured and exactly right: requiring userspace software to adopt a new kernel facility introduced in 7.0, in order to paper over a regression that only exists in 7.0+, is not a good deal. It also sits uncomfortably next to the “we do not break userspace” principle, which is usually invoked more aggressively than this.

Will Postgres eventually do something like it? Probably. There are other reasons to want slice-extension semantics around LWLockAcquire and the buffer strategy lock, independent of this specific regression. But it’s the kind of thing that belongs in a thoughtful patch in PG20 or PG21, not an emergency backport. In the meantime, there is already a mitigation that works today and has been the recommended configuration since before any of this was a question.

What you should actually do

If you are self-hosted on a real server

Set huge_pages = on in postgresql.conf and actually provision the huge pages. It’s not hard:

1 # Ask Postgres how many 2 MB pages it would need:
2 postgres -D $PGDATA -C shared_memory_size_in_huge_pages
3 
4 # Allocate them (persist this in /etc/sysctl.conf):
5 sysctl -w vm.nr_hugepages=<N>
6 
7 # Verify:
8 grep -E 'HugePages_Total|HugePages_Free' /proc/meminfo

Then set huge_pages = on. Not try. on. If you can’t start Postgres because you didn’t allocate enough pages, that is a useful failure — it tells you you didn’t provision correctly — not a reason to fall back to 4 KB pages silently. I have seen too many “why is it slow?” tickets where the answer turned out to be “huge_pages = try fell back three months ago and nobody noticed.”

Upgrade your kernel. You are not affected.

If you are self-hosted under containers

This is where the regression has teeth, and it has nothing specifically to do with Linux 7.0. Large Postgres under Kubernetes, with 100+ GB shared_buffers, without host-level huge page reservations, has been slow for years. The 7.0 change adds a second-order slowdown on top of an already bad situation.

Two paths. Either your container runtime and host are configured to expose huge pages to the Postgres pod (Kubernetes has a hugepages-2Mi resource; the host has to have pages reserved; the pod spec has to request them), or they aren’t. If they aren’t, shrink shared_buffers or move Postgres out of the container, because you are going to have an unpleasant time either way. The kernel upgrade just brings forward the day on which you realize it.

If you are on RDS, Aurora, Cloud SQL, Azure Flexible Server, Neon, or similar

You don’t manage the kernel. You don’t configure huge pages. You are relying on the vendor to get this right. AWS has shipped enough benchmark-driven content about huge pages on Graviton RDS instances that I would be surprised if it is not already enabled on Aurora Postgres and RDS Postgres instances sized where this would matter. Supabase, Neon, Crunchy Bridge, and the others have their own stories; ask your vendor.

None of them are going to let you set huge_pages yourself. That’s fine. It’s the vendor’s problem.

If your benchmark showed the regression

It showed it because you turned huge pages off, or didn’t turn them on, on a machine big enough that it mattered. Re-run the benchmark with huge_pages = on and properly allocated pool. If the regression disappears, you have your answer. If it doesn’t, send me the perf output — I’m curious.

The real lesson

The headline about Linux 7.0 halving Postgres throughput is a particular kind of benchmark artifact. A benchmark setup that ignores a configuration parameter everyone running 100+ GB of shared memory should have set for the last decade will, eventually, catch the platform doing something it’s been implicitly depending on. PREEMPT_NONE’s removal pulled that thread. The rest of the thread has been sitting there since the day somebody first set huge_pages = off on a production box with 100+ GB of shared memory and 96 vCPUs and hoped it would be fine.

Set huge pages. Upgrade your kernel. Move on.

1	\|- 56.03% - StartReadBuffer
2	\|- 55.93% - GetVictimBuffer
3	\|- 55.93% - StrategyGetBuffer
4	\|- 55.60% - s_lock <<<< 55% of CPU
5	\|- 0.08% - LockBufHdr
6	\|- 0.07% - hash_search_with_hash_value

1	# Ask Postgres how many 2 MB pages it would need:
2	postgres -D $PGDATA -C shared_memory_size_in_huge_pages
3
4	# Allocate them (persist this in /etc/sysctl.conf):
5	sysctl -w vm.nr_hugepages=<N>
6
7	# Verify:
8	grep -E 'HugePages_Total\|HugePages_Free' /proc/meminfo