Managed Postgres, Examined: Azure Database for PostgreSQL Flexible Server

Fifth in a series of dispassionate tours of managed PostgreSQL services. Previously: RDS, Aurora, Cloud SQL, and AlloyDB. Azure’s current general-purpose managed PostgreSQL has an HA mechanism that differs from every service covered so far, and the difference has teeth.

✦

Overview

Azure Database for PostgreSQL Flexible Server is Microsoft’s managed PostgreSQL offering. A Flexible Server instance is a community PostgreSQL process running on an Azure-managed VM, backed by Azure Premium SSD storage, with the usual managed-service apparatus: automated backups, PITR, optional HA, read replicas, server-parameter management, and networking integration.

The architectural fact that makes Flexible Server worth understanding, and the one most often missing from casual descriptions, is how it implements high availability. Flexible Server HA is synchronous PostgreSQL streaming replication at the database-process level, not block-level storage replication. Every other managed PostgreSQL in this series puts replication below the database, in the storage layer. Flexible Server puts it in the commit path of the PostgreSQL process itself. That single decision shapes most of the operational character of the product, for better and worse, and it is where most of this post lives.

One historical note and then no more: “Flexible Server” replaced the older “Single Server” offering, which Microsoft retired on March 28, 2025. Single Server is gone. If your organization is somehow still running it, you are past end of life and the migration to Flexible Server is no longer a planning exercise; it is overdue.

✦

Architecture

Compute and storage

A Flexible Server instance is a PostgreSQL process on an Azure VM, drawn from the general-purpose (D-series), memory-optimized (E-series), and burstable (B-series) VM families. Storage is Azure Premium SSD, with Premium SSD v2 available on higher-performance configurations. Storage is provisioned per-instance, can be grown without downtime, and, like every other managed service in this series, does not shrink.

A non-HA Flexible Server is a single VM with a disk. Conventional, unremarkable, and not the interesting case.

High availability: synchronous streaming replication

With HA enabled, Flexible Server provisions a standby VM running its own PostgreSQL process. The standby receives WAL from the primary over PostgreSQL’s native streaming protocol, and the replication is synchronous. Here is the part that matters: the primary does not acknowledge a commit to the client until the WAL for that commit has been persisted on both the primary and the standby. The standby flushes the WAL to durable storage before the commit returns.

What the acknowledgment does not wait for is apply on the standby. The standby stays in recovery, replaying WAL, and a commit is acknowledged once the record is durable on both nodes, not once it is visible on the standby. In PostgreSQL’s own terms this is synchronous_commit = on (equivalently remote_flush) behavior, not remote_apply. Microsoft confirms that remote_apply is not available on Flexible Server at all; you cannot set it.

This is stronger durability than remote_write, which would acknowledge once the WAL reached the standby’s OS buffers without a flush. Flexible Server waits for the flush. The durability guarantee is real: if you lose the primary, you do not lose committed transactions. You pay for that guarantee on every single commit, and the bill is the topic of half this post.

The exact values of synchronous_commit and synchronous_standby_names are service-managed. You cannot set them to anything that would break the HA contract, which has consequences discussed below.

Same-zone versus zone-redundant HA

The two HA modes differ in where the standby lives, and that is essentially the whole difference. The durability semantics are identical in both: synchronous, flush-on-both-nodes, no apply wait.

Same-zone HA places the standby in the same Availability Zone as the primary. The WAL round-trip is intra-AZ, sub-millisecond. It protects against node-level and rack-level failure. It does not protect against the loss of an entire AZ.

Zone-redundant HA places the standby in a different AZ. The replication is the same synchronous streaming replication; the difference is that every commit now pays an inter-AZ network round-trip, typically low single-digit milliseconds rather than sub-millisecond. It protects against AZ-level failure. It also taxes every commit accordingly.

flowchart LR Client[Application] -- writes --> Primary subgraph AZ1[Availability Zone 1] Primary[(Primary<br/>PostgreSQL)] PrimaryDisk[Premium SSD] Primary --> PrimaryDisk end subgraph AZ2[Availability Zone 2] Standby[(Standby<br/>PostgreSQL)] StandbyDisk[Premium SSD] Standby --> StandbyDisk end Primary -- WAL stream<br/>synchronous commit --> Standby classDef primary fill:#c3e0ff,stroke:#2a6fb3 classDef standby fill:#f5d5d5,stroke:#a52a2a classDef disk fill:#ffe9b3,stroke:#b58900 class Primary primary class Standby standby class PrimaryDisk,StandbyDisk disk

What happens when the standby is unhealthy

This is the question the brochure does not answer, and it is the most important question about this architecture.

Because the standby is in the commit path, the standby’s health is the primary’s commit latency. A slow standby disk, a standby VM caught in host-level noisy-neighbor contention, a brief inter-AZ network blip: any of these raises commit latency on the primary directly, for the duration of the problem, until Azure’s health monitoring decides the standby is unhealthy enough to break replication and let the primary proceed alone. During that window the application sees elevated commit times that have nothing to do with the primary’s own state.

This is not a bug. It is the durability guarantee of synchronous replication working exactly as designed; it refuses to lose data, even at the cost of refusing to make progress. It is also a real operational difference from the block-level-HA models (RDS Multi-AZ, Cloud SQL Enterprise HA), where replication sits below the database process and the database’s observable behavior is far less coupled to the standby’s health.

The detail that decides how bad this gets is synchronous_standby_names, and it is service-managed, so you do not get to see or set it. Whether one healthy standby acknowledging is sufficient, or the topology requires a specific standby, determines whether a single sick node can stall commits. You are trusting Microsoft’s choice here, and you cannot inspect it.

Failover

On primary failure, Azure’s health monitoring detects the loss (typically within 30 to 40 seconds, tuned to avoid false positives), breaks replication, runs recovery on the standby, promotes it, and updates the server’s FQDN to point at the new primary. End to end this is generally in the 60-to-120-second range. Existing connections drop; new connections fail until DNS propagates.

The promoted standby then runs without a peer until Azure provisions and seeds a fresh standby. For a large database that rebuild takes time, and during it the server is effectively non-HA. “HA-enabled” does not mean “continuously HA.” It means “HA except during the window right after you most recently needed it.” Plan for that window.

One genuine upside falls out of the process-level architecture: the standby has been replaying WAL continuously, so its buffer cache is warm, and failover avoids the cold-cache latency cliff that single-standby block-level failover can hit. The warmth is real but partial. The standby’s cache reflects the pages the primary has been writing, not the pages the primary’s read workload was touching, so it is warmer than a cold start without being a mirror of the primary’s working set.

Read replicas

Flexible Server supports up to five read replicas over asynchronous streaming replication, in-region or cross-region, each with its own endpoint. Cross-region replicas carry the usual cross-region lag.

Read replicas are entirely separate from the HA standby. An HA-enabled server with replicas has a primary, a synchronous standby that is not readable, and up to five asynchronous read replicas that are. A replica can be promoted to a standalone server, which breaks replication; as everywhere else, that is one-way.

Networking

Flexible Server offers two private networking models, and the choice is harder to change after provisioning than almost anything else, so make it deliberately.

VNet integration (“private access”) places the instance in a delegated subnet of your VNet with a private IP and no public endpoint. This is the typical production model.

Private Link / private endpoint exposes the instance into your VNet through a private endpoint while the compute sits in Microsoft’s infrastructure. The two models have different implications for cross-subscription access, network topology, and service-endpoint behavior. Understand the difference before you pick; do not pick whichever showed up in the first tutorial you read.

Public access is permitted with explicit IP allow-listing and is the wrong choice for production.

Built-in PgBouncer

Flexible Server ships with PgBouncer built in, enabled by server parameter and exposed on the instance. This is distinctive among the services in this series: RDS has the separate RDS Proxy product, and Cloud SQL and Aurora ship no built-in pooler. Enabling it gives you transaction-level pooling at the instance without standing up a separate pooling tier.

For workloads with many short-lived connections, which stock PostgreSQL handles badly without a pooler, having this available without extra infrastructure is a real convenience. Serverless-style architectures and high-connection-churn legacy applications benefit most.

Superuser and permissions

No true SUPERUSER, same as every other managed PostgreSQL here. The azure_pg_admin role covers what most DBAs need day to day, with the usual managed-service restrictions: no filesystem access, no arbitrary C extensions, nothing that would let you violate the service contract.

✦

Features

Backups and PITR

Continuous automated backups with a configurable 7-to-35-day retention window, PITR to any point inside it, and long-term retention as a separate feature. Geo-redundant backup (a copy in the paired Azure region) is available as an option and is a regional-DR mechanism distinct from the read-replica path. Restoration provisions a new instance from the recovery point; it is not an in-place rewind.

Major version upgrades

Flexible Server supports in-place major version upgrades, with downtime on the order of minutes to tens of minutes depending on database size; for HA servers the upgrade coordinates with the standby. For a strict zero-downtime requirement, the answer is the usual one: stand up a new-version instance, replicate logically, cut over.

Version tracking is good. Azure has historically added new major versions within months of community GA, with PostgreSQL 17 supported. This is competitive with Cloud SQL and ahead of Aurora’s and AlloyDB’s typical fork lag, because Flexible Server runs community PostgreSQL rather than a fork and has less porting work to do. Azure also now offers its own paid Extended Support for versions past community EOL, the same back-patching-for-a-fee model that has shown up across the managed-Postgres market.

Extensions

Flexible Server gatekeeps extensions with a two-step model that surprises people coming from RDS or Cloud SQL. You first add an extension to the server-level azure.extensions allow-list parameter, and only then can you CREATE EXTENSION it at the database level. Forget the first step and the CREATE EXTENSION fails with an error that does not obviously point at the cause. The allow-list covers the usual suspects: PostGIS, pg_stat_statements, pgvector, pg_cron, pg_partman, pg_trgm, and many more. Custom C extensions outside the list cannot be installed, and there is no customer-defined-extension escape hatch analogous to RDS’s pg_tle.

Server parameters

Parameters are exposed directly through the portal, CLI, and API. The model reads like configuring postgresql.conf through a managed interface rather than RDS’s create-a-parameter-group-and-apply-it indirection. Some parameters require a restart; most do not. The set of user-modifiable parameters is smaller than stock PostgreSQL, but the ones you can change, you change directly. For DBAs used to editing postgresql.conf, this is the most comfortable parameter experience of the managed services in this series.

Authentication

Password authentication and Microsoft Entra ID (formerly Azure Active Directory). Entra ID lets database users authenticate with their Azure identities, in the same spirit as IAM auth on Cloud SQL or RDS, and it is the right choice for workloads running on Azure compute with managed identities (App Service, AKS, VMs). The integration is mature.

Near-zero-downtime maintenance

On HA-enabled instances, Microsoft performs maintenance by patching the standby, failing over to it, and patching the old primary, which collapses maintenance downtime to the few seconds of the failover itself. Non-HA instances take the conventional in-place restart. This is a good reason to run HA even on workloads whose availability requirements alone might not demand it.

Monitoring

Azure Monitor surfaces the standard metrics (CPU, memory, IO, connections, replication lag) through the usual Azure stack. Query Store (related in lineage to, but not the same as, SQL Server’s Query Store) captures query-level telemetry, including plans over time and wait-event breakdowns, accessible through SQL views. Combined with pg_stat_statements, it gives a reasonable in-platform performance-telemetry story for teams that would rather not leave Azure tooling.

Storage autogrow

Storage can grow automatically as it approaches capacity. It still does not shrink. Nothing in this series shrinks.

✦

Non-brochure concerns

The standby is in the write path, and you will eventually feel it

The single most important operational fact about Flexible Server HA: every commit on an HA-enabled primary waits for the standby. Healthy, this is invisible. Degraded (a slow standby disk, a brief inter-AZ hiccup, a standby host under contention) and the primary’s commit latency climbs in lockstep with the standby’s trouble, with no fault on the primary itself.

The failure mode that surprises teams migrating from a block-level-HA service is a commit-latency spike on the primary with no primary-side cause. The cause is almost always the standby. Azure Monitor will happily show you the primary flatlining and the standby quietly sick; diagnosing it means looking at both sides at once.

A few things from diagnosing this in the field are worth stating plainly, because none of them are obvious from the documentation.

The blast radius scales with commit count, not commit size. A 4 KB metadata commit waiting on a degraded standby flush stalls exactly as completely as a 50 KB TOAST-heavy commit. When the standby or the inter-AZ path slows down, the variable that determines how badly you are hurt is how many transactions per second you are committing, not how big they are. This matters because the intuitive remediation (“our rows are too big, let us move the large payloads off Postgres”) does nothing for sensitivity to standby latency. If you want to reduce exposure to this failure mode, reduce commit frequency (batch small writes), not commit size.

Watch for the SyncRep wait event on the primary. When synchronous replication is on the commit path and the standby is slow, backends park in the SyncRep wait event waiting for the standby’s acknowledgment. Seeing SyncRep show up as a meaningful share of primary wait time is direct, in-telemetry confirmation that you are looking at a synchronous-replication stall and not a primary-local problem. It is the first thing to check, and most observability stacks (including a properly configured Datadog or pg_stat_activity sampling) will show it.

Cancelling a stuck COMMIT does not roll it back. This one is a genuine footgun. If a COMMIT is parked waiting on synchronous replication and you cancel it (statement timeout, client cancel, pg_cancel_backend), the transaction does not abort. The local WAL flush has already happened; the only thing being waited on is the standby. You get WARNING: canceling wait for synchronous replication due to user request, and the transaction returns committed, just without the quorum-durability guarantee. Application code that treats a timed-out or cancelled COMMIT as “failed, retry the whole transaction” will double-write. If you set aggressive statement timeouts on an HA Flexible Server, audit your retry logic before you find out the hard way.

The flip side, in fairness: this architecture gives you a durability guarantee that is easy to explain and easy to audit. Committed transactions are flushed to disk on two independent VMs in two independent failure domains. Block-level HA provides a comparable guarantee through a different mechanism, but “the WAL is durable on two PostgreSQL servers before the client hears ‘committed’” is a sentence an auditor understands without a diagram.

Zone-redundant HA’s commit latency is a real throughput tax

Choosing zone-redundant over same-zone adds the inter-AZ round-trip to every commit, on the order of 1 to 3 ms when healthy. For a workload doing thousands of small OLTP commits per second, that shows up in tail latency and in throughput ceilings, and there is no database-layer tuning that engineers it away, because it is a property of the durability contract, not a configuration accident.

The decision is a recovery-objective decision, not a performance one. Same-zone HA tolerates node and rack failure; zone-redundant tolerates the loss of an AZ. If you have a regulatory or contractual requirement for AZ-independent durability, choose zone-redundant and size for the latency. If your worst-case-acceptable scenario is “the AZ dies, we fail over to a cross-region read replica and accept the RPO,” same-zone HA buys back the commit latency.

You cannot tune the durability tradeoff yourself

On a self-managed cluster you could set synchronous_commit = local for a bulk-load window and resync afterward, or run quorum commits through Patroni with a tuned synchronous_standby_names. On Flexible Server you can do none of this; the synchronous-replication parameters are service-managed, and a uniform synchronous-commit tax applies to every write regardless of how much you would like to relax it for a specific phase of work. For most workloads that is the right default. For workloads with a genuine bulk-ingest phase, it is a constraint to know about going in.

Same-zone and zone-redundant are not trivially interchangeable

Switching HA mode reprovisions the standby. Choose at provisioning time rather than assuming you will flip it later.

The burstable tier is a production hazard

The B family looks attractive for low-traffic workloads: cheap at idle, bursts when needed. The non-obvious part is the credit model. Burstable instances accumulate CPU credits while idle and spend them under load; a workload with sustained above-baseline CPU eventually exhausts its credits and falls back to baseline, which is a fraction of the nominal vCPU. The failure mode is “performance collapses precisely when the workload gets busy,” which is the worst possible time. Fine for dev and test. Rarely the right call for production, no matter how small the instance looks at provisioning time. Burstable also does not support HA at all, which settles the question for anything that needs a standby.

Azure Monitor logging has its own bill

The same observation that applied to Cloud SQL and Cloud Logging applies here: turning query logging up to a genuinely useful verbosity and routing it through Azure Monitor and Log Analytics generates ingestion volume that is billed separately from the database compute. Budget for it before you discover it on an invoice.

No true SUPERUSER

azure_pg_admin does most of what most DBAs need, with the usual restrictions. Same story as the rest of the series.

✦

Positives

Community PostgreSQL, not a fork. Flexible Server runs stock PostgreSQL. Behaviors, extensions, and internals match self-hosted PostgreSQL, and tuning knowledge transfers without an asterisk. For a team with real PostgreSQL operational depth, that transfer is the whole value proposition.

A clean, inspectable HA model. The standby is a real PostgreSQL server, the replication is PostgreSQL’s own streaming replication, and the failure modes are the ones anyone who has run streaming replication already understands. Fewer layers of vendor abstraction sit between you and what is actually happening.

Built-in PgBouncer. Instance-level transaction pooling without a separate tier. Most competitors do not offer this.

Good upstream version tracking. New major versions land within months of community GA, ahead of the forks.

Direct parameter management. Setting parameters feels like administering PostgreSQL, not like operating a parameter-group abstraction.

Microsoft Entra ID integration. A clean identity story for Azure-native workloads that removes a category of secret-handling.

Warm-cache failover. The standby has been applying WAL, so post-failover latency profiles beat cold-standby designs, with the caveat that the warm cache reflects writes, not the read working set.

Near-zero-downtime maintenance on HA instances. A real quality-of-life improvement, and an argument for enabling HA on its own merits.

Auditable synchronous durability. “Committed data is flushed on two servers in two failure domains before the client is told it committed” is a guarantee that maps directly onto a compliance requirement.

✦

Negatives

The standby is in the commit path. Under degraded standby conditions, primary commit latency rises, and you cannot tune your way out of it at the database layer. The blast radius scales with commit rate, the diagnostic tell is the SyncRep wait event, and cancelling a stuck COMMIT does not roll it back. This is the defining operational cost of the product.

Zone-redundant HA has a per-commit latency tax. Every commit pays the inter-AZ round-trip. High-commit-rate workloads feel it.

Post-failover degraded-HA window. Standby rebuild after a failover takes time proportional to database size, and you are non-HA until it finishes.

Service-managed durability parameters. No selectively relaxing synchronous_commit for a bulk-load phase, no quorum tuning.

Constraining extension allow-list. Common extensions are present; uncommon ones are not, and there is no pg_tle-style customer-defined-extension mechanism.

Burstable tier is a trap for production. Easy to choose, hard to diagnose, and incompatible with HA.

Networking model is hard to change later. VNet integration versus Private Link is a provisioning-time decision with lasting consequences.

No true SUPERUSER. Same limitation as the rest of the field.

No distinctive compute or storage architecture. No Aurora-style cloning, no AlloyDB columnar engine, no Cloud SQL data cache. Flexible Server is well-run stock PostgreSQL on a VM with a disk, and it does not pretend otherwise.

✦

Best-fit workloads and organizations

General-purpose OLTP on Azure. For applications on App Service, AKS, Functions, or Azure VMs that need standard PostgreSQL with credible HA, Flexible Server is the obvious managed choice. The engine behaves like PostgreSQL because it is PostgreSQL.

Workloads that benefit from built-in pooling. High-connection-churn applications, serverless patterns, and microservice fleets without a central pooling tier.

Entra ID-centric organizations. Where Azure identity is already the backbone, the auth integration removes real work.

Workloads that need auditable synchronous durability. When the requirement is “committed transactions live on two independent disks in two independent VMs,” process-level synchronous replication maps onto it cleanly.

Teams with strong generic PostgreSQL expertise who want that expertise to transfer without qualification.

Poor fits

Latency-sensitive high-commit-rate write workloads where inter-AZ synchronous replication is unacceptable. Same-zone HA is the escape valve; zone-redundant adds commit latency you cannot tune away.

Workloads needing PostgreSQL features outside the allow-list. Custom C extensions, certain replication-slot-dependent workflows, anything wanting true SUPERUSER.

Workloads wanting exotic storage or columnar acceleration. Flexible Server is compute-plus-disk. The distinctive-storage benefits of Aurora or AlloyDB are not here.

Workloads with hard scale-to-zero requirements. Burstable tiers exist but carry the credit-exhaustion caveat, and true scale-to-zero is not the model.

✦

Next in this series: Azure Cosmos DB for PostgreSQL, the distributed-PostgreSQL offering with Citus underneath, which targets a workload that looks nothing like Flexible Server.