Huge Pages, End to End

A blue elephant in a cube, covered huge pages (get it?) covered with ones and zeros.

The previous post on the Linux 7.0 pgbench regression ended with the same instruction every other Postgres performance post ends with: set huge pages. This post is the long version. If you have read the Postgres docs on huge_pages and you’re still not completely sure what /proc/meminfo is telling you, what the relationship is between vm.nr_hugepages and Transparent Huge Pages, or why huge_pages = try is the wrong choice, this is for you.

There’s not a lot of new ground in here. What there is, is ground people keep tripping over despite it being covered in the documentation.

What huge pages are, in one paragraph

The CPU translates virtual addresses to physical addresses through a cache called the TLB (Translation Lookaside Buffer). Each TLB entry covers one page of memory. Default x86_64 and arm64 pages are 4 KB. The TLB is small — on modern Intel/AMD/Graviton cores, on the order of 1,500 to 4,000 entries across levels. 1,500 entries × 4 KB = 6 MB of memory covered. The first Postgres buffer you touch outside that 6 MB causes a TLB miss, which costs tens of cycles and may walk the page table. Huge pages replace 4 KB pages with 2 MB or 1 GB pages, multiplying TLB coverage by 512x or 262,144x respectively. On a 100+ GB shared_buffers, this isn’t a nice-to-have; the alternative is a TLB that can’t cache more than a rounding error of your buffer pool.

That’s the entire pitch. Everything else in this post is operational.

Explicit HugeTLB vs. Transparent Huge Pages

This is the distinction nobody outside kernel people keeps straight, and the one that makes Postgres threads on the subject confusing.

Explicit HugeTLB is the original mechanism. You tell the kernel at boot or at runtime: reserve N huge pages. Those pages are taken out of the normal memory pool, held separately, and handed out only when something explicitly asks for them via MAP_HUGETLB (or hugetlbfs). If nothing uses them, they sit idle. If you ask for more than were reserved, you get an error. This is what the Postgres huge_pages GUC drives.

Transparent Huge Pages (THP) is a kernel feature that opportunistically coalesces 4 KB pages into 2 MB pages in the background, on memory the application never told it was special. No reservation, no explicit request. A userspace process that does not know THP exists may end up running on top of it anyway. The kernel thread khugepaged does this coalescing work asynchronously.

They are not the same feature, they are not interchangeable, and they have very different operational characteristics. The Postgres community recommendation has historically been: use explicit HugeTLB via huge_pages = on, disable or de-fang THP. That recommendation is slightly out of date now — more on that below — but the underlying distinction still matters.

flowchart TB subgraph EXP["Explicit HugeTLB (what huge_pages = on uses)"] direction LR E1[vm.nr_hugepages = N] --> E2[N × 2 MB pages reserved at boot or sysctl] E2 --> E3[Postgres mmap with MAP_HUGETLB] E3 --> E4[Postgres gets the pages or fails to start] end subgraph THP["Transparent Huge Pages (the kernel doing it to you)"] direction LR T1[Anonymous allocation] --> T2[Regular 4 KB pages] T2 --> T3[khugepaged coalesces in background] T3 --> T4[Some pages become 2 MB silently, maybe] end

Sizing the reservation

PostgreSQL 15 and later give you a direct answer:

1 postgres -D $PGDATA -C shared_memory_size_in_huge_pages

That’s it. It returns the number of huge pages Postgres wants, at the current huge_page_size (default is the system default, usually 2 MB). Round up slightly if you run other shared-memory consumers on the same box — the number reflects Postgres’s needs, not the kernel’s total reservation.

A sanity check: shared_buffers = 96 GB at 2 MB huge pages is 49,152 huge pages. At 1 GB pages, 96. Some slop for the small shared-memory segments (notifications, stats, etc.) — call it 50,000 or 100 respectively. shared_memory_size_in_huge_pages does the math correctly; use it.

Reserve via sysctl:

1 # Runtime (may fail if memory is too fragmented):
2 sysctl -w vm.nr_hugepages=50000
3 
4 # Persistent:
5 echo 'vm.nr_hugepages = 50000' > /etc/sysctl.d/10-postgres-hugepages.conf
6 sysctl --system

On a machine where you care enough about this to be reading this post, reserve at boot. Runtime allocation on a fragmented system can partially fail — you ask for 50,000 pages and the kernel finds contiguous physical memory for 47,200. If you’re planning to huge_pages = on and you don’t have enough, Postgres will refuse to start. If you’re on try, Postgres will silently start with 4 KB pages and you’ll wonder why performance went sideways three months later.

Boot-time reservation goes in the kernel command line:

1 hugepagesz=2M hugepages=50000

Or, for 1 GB pages (which must be reserved at boot on most systems — the runtime allocator usually can’t find a single gigabyte of contiguous physical memory on a long-lived host):

1 hugepagesz=1G hugepages=96 default_hugepagesz=1G

Verify:

1 grep -E 'HugePages_|Hugepagesize' /proc/meminfo

You want HugePages_Total equal to your reservation and HugePages_Free equal to HugePages_Total before Postgres starts. After Postgres starts, HugePages_Free should drop by shared_memory_size_in_huge_pages.

The `huge_pages` GUC: `off`, `try`, `on`

Three values. One correct answer for anyone this post is addressed to.

off disables explicit huge-page use. You’re back on 4 KB pages for shared_buffers. Don’t pick this unless you are deliberately benchmarking the 4 KB case, or you’re in an environment where huge pages are structurally unavailable and there’s nothing you can do about it.

try is the default, and it is the wrong default for anyone running real shared_buffers. try asks for huge pages; on failure, it falls back to 4 KB silently and starts anyway. The point of try was to keep defaults “friendly” so a beginner installation on a small VM doesn’t fail to start. The cost is that the failure mode of a production instance where somebody changed a sysctl or rebooted the host with the wrong tuning is not “won’t start, fix it now.” It’s “running fine, half the throughput, no alert.”

on is what you want. Postgres will refuse to start if it can’t get the huge pages it asked for. That’s not a bug, that’s the feature. A refusal to start is a loud, obvious, correctable failure. A silent fallback is a quiet, subtle, uncorrectable failure that gets discovered three months later when somebody runs a profiler.

Set huge_pages = on. Not try. on.

There is exactly one environment where try is defensible: you are running an image that has to start in places you don’t control, and failing to start is materially worse than running slowly. In every other case, on is the right setting.

`huge_page_size`: 2 MB vs 1 GB

PostgreSQL 15 added the huge_page_size GUC. On x86_64 and arm64, the choices in practice are 2 MB and 1 GB.

2 MB is the safe default. The kernel can allocate them at runtime on a reasonably unfragmented host. Most managed services that expose huge pages at all expose 2 MB. The TLB entry count reduction versus 4 KB is 512x, which is plenty.

1 GB pages require boot-time reservation and a single huge page is, by definition, not going to come from anywhere else. If you size 1 GB huge pages in /etc/default/grub kernel params and allocate 128 of them, you have permanently dedicated 128 GB of RAM to huge pages and the general kernel allocator cannot touch it. The TLB reduction versus 2 MB is 512x on top of the 512x you already got — essentially, TLB misses stop being a performance variable. For the largest instances (say, shared_buffers north of 256 GB), 1 GB pages are a measurable improvement and the operational overhead is low. For anything smaller, 2 MB pages are fine and easier.

Transparent Huge Pages: disable, enable, or don’t care

Historically, the advice was simple: disable THP for Postgres. The reason was real: early THP implementations could stall allocations while khugepaged coalesced memory, produced latency spikes under memory pressure, and interacted badly with the page cache on fork-heavy workloads. Postgres is fork-heavy. The combination was bad.

Current state of play:

THP is still sometimes disabled by default on database-oriented distro configurations, and that’s fine.
If you have configured explicit HugeTLB via huge_pages = on, THP’s behavior on your shared_buffers region is irrelevant — those pages come from the explicit reservation, not from anonymous-mapping coalescing.
THP does affect other large anonymous allocations in Postgres (sort scratch, hash tables, work_mem). This is where the historical horror stories came from.
On modern kernels (≥ 6.x) the stall behavior is much better than it was on 3.x and 4.x, and “defer” mode (defrag = defer+madvise) avoids the worst cases.

So: if you’re on explicit huge pages for shared memory — which is the posture this post recommends — then what THP is doing on the rest of Postgres’s memory is a secondary question, not a catastrophic one. Freund’s finding on the Linux 7.0 regression is relevant here: THP alone (with huge_pages = off) was sufficient to close the gap in his testing, because THP was opportunistically upgrading the shared-memory region to 2 MB pages anyway.

My recommendation hasn’t changed much:

1 transparent_hugepage=madvise    # kernel command line

1 echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
2 echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag

madvise means THP is available to code that opts in (via madvise(MADV_HUGEPAGE)) and invisible to code that doesn’t. This avoids the historical problems and costs nothing. always is aggressive and occasionally still produces latency spikes on older kernels; never is overcautious on modern kernels. madvise splits the difference and stays out of the way.

If your distro’s default is already madvise or never, leave it alone. If it’s always, consider madvise. Do not lose sleep over this.

The cookbook

If you are self-hosted on a dedicated server

Compute pages with shared_memory_size_in_huge_pages. Reserve at boot via kernel command line or /etc/sysctl.d/. Set huge_pages = on. Verify with /proc/meminfo. Done.

If you are on a virtual machine you control

Same as above. Make sure the hypervisor isn’t balloon-reclaiming the huge-page region; on KVM/QEMU guests this is not usually an issue, but on oversubscribed VMware environments it occasionally is. If the hypervisor goes into memory pressure, huge pages can become unallocatable in the guest even if vm.nr_hugepages looks fine.

If you are on Kubernetes or another container orchestrator

Two separate things have to be true: the host must have huge pages reserved (via kernel command line, daemonset, or node-init script), and the pod spec must request them. The resource names are hugepages-2Mi and hugepages-1Gi:

1 resources:
2   limits:
3     memory: 128Gi
4     hugepages-2Mi: 96Gi
5   requests:
6     memory: 128Gi
7     hugepages-2Mi: 96Gi

The pod needs SYS_RESOURCE or equivalent to actually use the huge pages. Many operators (CloudNativePG, StackGres, Zalando Postgres Operator) handle this for you if you configure the right fields. Read your operator’s docs; “I set huge_pages = on and it doesn’t work in the container” nine times out of ten means the host doesn’t have them reserved or the pod didn’t request them.

If you are running Postgres in a container without host-level huge pages, and you have more than 16 GB of shared_buffers, you are accepting performance you don’t need to accept. The fix is infrastructural, not a Postgres tuning change.

If you are on RDS, Aurora, Cloud SQL, Azure Flexible Server, or similar

You can’t set this. The vendor handles it. Pick an instance class that makes sense for your workload and trust that AWS/Google/Azure are provisioning huge pages sensibly on the instance classes they sell as “memory-optimized.” They mostly are. If you suspect they aren’t, open a support ticket with a perf profile; do not try to work around it.

If you are on Supabase, Neon, Crunchy Bridge, or one of the other managed Postgres startups

Ask them. Their answers vary. Crunchy Bridge, in particular, is run by people who know the difference between THP and HugeTLB and is configured appropriately. For the others, check their documentation or ask support.

Verification

A deployment checklist I use after provisioning:

1 # Reservation is in place:
2 grep -E 'HugePages_Total|Hugepagesize' /proc/meminfo
3 # HugePages_Total: 50000
4 # Hugepagesize:       2048 kB
5 
6 # Postgres is using them:
7 psql -c "SHOW huge_pages"
8 # huge_pages | on
9 
10 # The reservation actually dropped when Postgres started:
11 grep HugePages_Free /proc/meminfo
12 # HugePages_Free:  <should be smaller than HugePages_Total by
13 #                  roughly shared_memory_size_in_huge_pages>
14 
15 # Postgres is not silently on 4 KB pages:
16 awk '/VmPeak|VmHWM|HugetlbPages/' /proc/$(pgrep -o postgres)/status
17 # HugetlbPages:   <should be multiple GB; if 0, you are on 4 KB pages>

That last check is the one most people skip and the one most worth keeping in a monitoring script. HugetlbPages: 0 on the postmaster process means huge_pages = try fell back and nothing told you.

When huge pages are not worth the trouble

A short list:

shared_buffers < 2 GB. TLB reach isn’t the bottleneck at that size.
Development/CI machines where startup failures are worse than performance loss.
Shared-tenancy hosts where you cannot reserve memory.

Everywhere else: set them, verify them, and move on.

The operational payoff

Done correctly, huge pages are one of the cheapest, most reliable performance improvements available to a Postgres server above about 16 GB of shared_buffers. The only reason they’re not universal is that they’re a second configuration step, and second configuration steps don’t happen unless someone is paying attention.

The Linux 7.0 regression that motivated the previous post is the latest in a long series of platform-level changes that implicitly assume you’ve set huge pages, and punish you when you haven’t. There will be another one. Set them now.

1	# Runtime (may fail if memory is too fragmented):
2	sysctl -w vm.nr_hugepages=50000
3
4	# Persistent:
5	echo 'vm.nr_hugepages = 50000' > /etc/sysctl.d/10-postgres-hugepages.conf
6	sysctl --system

1	echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
2	echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag

1	resources:
2	limits:
3	memory: 128Gi
4	hugepages-2Mi: 96Gi
5	requests:
6	memory: 128Gi
7	hugepages-2Mi: 96Gi

1	# Reservation is in place:
2	grep -E 'HugePages_Total\|Hugepagesize' /proc/meminfo
3	# HugePages_Total: 50000
4	# Hugepagesize: 2048 kB
5
6	# Postgres is using them:
7	psql -c "SHOW huge_pages"
8	# huge_pages \| on
9
10	# The reservation actually dropped when Postgres started:
11	grep HugePages_Free /proc/meminfo
12	# HugePages_Free: <should be smaller than HugePages_Total by
13	# roughly shared_memory_size_in_huge_pages>
14
15	# Postgres is not silently on 4 KB pages:
16	awk '/VmPeak\|VmHWM\|HugetlbPages/' /proc/$(pgrep -o postgres)/status
17	# HugetlbPages: <should be multiple GB; if 0, you are on 4 KB pages>