PgQue: Two Snapshots and a Diff

PgQue shipped its v0.1 last week, and the part I want to talk about is not what the announcement leads with — managed-Postgres compatibility, no C extension, no daemon. Those are real, but they’re packaging. The part that’s worth understanding is the implementation, because PgQue is a working in-database queue whose hot path contains zero UPDATEs, zero DELETEs, zero SELECT ... FOR UPDATE SKIP LOCKED, and consequently zero dead tuples on the event tables.

It does this by being a port of PgQ, the Skype-era queue engine from circa 2007, into pure SQL and PL/pgSQL. The core algorithm is unchanged from Marko Kreen’s original. What’s interesting is that the algorithm is unfamiliar to a generation of Postgres developers who learned queueing through SKIP LOCKED, and is more elegant than they’re going to expect.

The model everyone reaches for first

If I asked you to write a Postgres queue from scratch this afternoon, you would write this:

1 -- Producer
2 INSERT INTO jobs (payload) VALUES (...);
3 
4 -- Consumer
5 WITH claimed AS (
6     SELECT id, payload FROM jobs
7      WHERE state = 'pending'
8      ORDER BY id
9      LIMIT 100
10        FOR UPDATE SKIP LOCKED
11 )
12 UPDATE jobs SET state = 'processing'
13  WHERE id IN (SELECT id FROM claimed)
14 RETURNING id, payload;
15 
16 -- After processing
17 DELETE FROM jobs WHERE id = ANY($1);

(I have written this more times than I can count.)

This works. It works in benchmarks, it works in dev, it works for the first month of production. Then it stops working, and the failure is always the same: a long-running transaction (a slow analytical query, an idle-in-transaction connection, a logical replication subscriber falling behind, a physical standby with hot_standby_feedback=on) holds back the global xmin. Autovacuum can no longer reclaim the dead tuples produced by the UPDATE-then-DELETE cycle. The jobs table grows. Index lookups slow down. The producer’s INSERT slows down because the indexes are full of dead entries. The consumer’s SKIP LOCKED scan slows down because it has to walk past more invisible tuples to find live work. Throughput drops. The backlog grows. Throughput drops more. This is what Brandur called the “queue death spiral” at Heroku in 2015 and what PlanetScale hit again in 2026 at 800 jobs/sec while running OLAP on the side. It’s not a bug. It’s the model.

The mitigations everyone reaches for are real — partition the table (Oban Pro), tune autovacuum aggressively (PGMQ), use xact_id ordering to keep the working set hot — but they are mitigations for a design choice. The design choice is: the consumer mutates the queue. Every SKIP LOCKED queue does this, because every SKIP LOCKED queue needs a way to tell the next consumer “this row is mine, do not pick it up.” Even if you do not change any data in the row, locking a row requires a change to the tuple to mark the row lock, which (potentially) creates write I/O.

What PgQ does instead

PgQ inverts the model. The consumer never mutates the event table. The consumer reads.

The producer inserts events into one of three rotating event tables (pgque.event_<queue_id>_0, _1, _2). The insert path is a plain INSERT — the only mutating operation in the queue’s hot path, and it’s the only one that’s required to be a mutation in any database queue ever built. Each event row carries an ev_txid xid8 — the transaction ID that inserted it — set by pg_current_xact_id() as a column default.

So far so unremarkable. Here’s where it gets interesting.

A separate process, the ticker, runs once per second (in PgQue, by pg_cron; in upstream PgQ, by the pgqd daemon) and inserts a row into pgque.tick:

1 create table pgque.tick (
2     tick_queue     int4         not null,
3     tick_id        bigint       not null,
4     tick_time      timestamptz  not null default now(),
5     tick_snapshot  pg_snapshot  not null default pg_current_snapshot(),
6     tick_event_seq bigint       not null,
7     primary key (tick_queue, tick_id)
8 );

The interesting column is tick_snapshot. It’s a pg_snapshot — Postgres’ on-the-wire representation of an MVCC snapshot, captured at the moment the tick row was inserted. A pg_snapshot records xmin, xmax, and the list of in-progress transaction IDs at that moment. It is, in other words, the information you’d need to answer the question “for any given xid8, was that transaction visible at this point in time?”

(If your mind is slightly blown by pg_snapshot being a type that can be included in a database row, I thought it was pretty wild too the first time I encountered it.)

A batch is a pair of consecutive ticks. When a consumer asks for the next batch, the system pairs (last_tick, next_tick) and hands back a batch ID. The events in that batch are defined as: every event whose inserting transaction was not visible in last_tick.tick_snapshot and was visible in next_tick.tick_snapshot. That set is computable, deterministically, by anyone who can see those two snapshot values, using pg_visible_in_snapshot(xid8, pg_snapshot) — a function added in PostgreSQL 14 that does exactly what its name claims.

The query the batch executes is, conceptually:

1 SELECT ev_id, ev_time, ev_txid, ev_type, ev_data, ...
2   FROM pgque.event_<queue>_<n> ev,
3        pgque.tick last,
4        pgque.tick cur
5  WHERE last.tick_id = $sub_last_tick
6    AND cur.tick_id  = $sub_next_tick
7    AND     pg_visible_in_snapshot(ev.ev_txid, cur.tick_snapshot)
8    AND NOT pg_visible_in_snapshot(ev.ev_txid, last.tick_snapshot)
9 ORDER BY 1;

That’s it. That’s the consumer hot path. It’s a SELECT. The consumer takes no locks on event rows, mutates no event rows, and has no opinion about whether any other consumer has seen the same events. Two different consumers asking for “everything new since their respective last ticks” each get their own batch over the same shared event log, with no contention and no row-level coordination. Fan-out is free, because nobody owns anything.

ack updates the consumer’s cursor (pgque.subscription.sub_last_tick) to the just-finished tick. That’s a single-row UPDATE on a tiny metadata table — not on the event table.

The optimization that makes it fast

The naive snapshot-diff query has a predictable problem: a long-running transaction that started before last_tick will still be in pg_snapshot_xip() of every batch until it commits or aborts. xmin of every subsequent snapshot is pinned at that transaction’s xid, and not pg_visible_in_snapshot(ev.ev_txid, last.tick_snapshot) becomes a scan of every event whose ev_txid >= xmin1. As snapshots advance, that scan grows.

pgque.batch_event_sql() solves this with two layered tricks straight from Kreen’s original.

The first is to constrain the range scan to [xmax(last), xmax(cur)] — that is, scan only transaction IDs that completed between the two snapshots, not from the long transaction’s xid forward. This bounds the work per batch to “transactions that actually finished in this tick window,” regardless of how long the long-running transaction has been running.

The second deals with the leftover: transactions that were in pg_snapshot_xip(last) (still active at the previous tick) but are no longer in pg_snapshot_xip(cur) (have since committed). These are exactly the transactions whose events are newly visible to this batch but live below xmax(last). PgQ enumerates them with a left join:

1 select id1
2   from pg_snapshot_xip(last_snapshot) id1
3   left join pg_snapshot_xip(cur_snapshot) id2 on id1 = id2
4  where id2 is null;

…and inlines them as an IN (...) list UNION ALL‘d to the range scan. There’s even a small refinement that absorbs IDs near xmax(last) back into the range to keep the IN list short.

The result is that the batch query, on a healthy queue, scans O(events_in_window) regardless of long-transaction pressure. The query plan is a B-tree range scan on ev_txid plus a tiny IN lookup. No LIMIT FOR UPDATE, no advisory lock, no pg_try_advisory_lock per row — none of the apparatus a SKIP LOCKED queue needs to keep consumers from stepping on each other.

Why TRUNCATE rotation matters here

If the event tables only ever grow, you eventually have to delete from them, and we are back where we started. PgQ’s solution: keep three tables per queue and rotate them like a circular buffer. The producer always writes to the current table (queue_cur_table). Every queue_rotation_period (default two hours), pgque.maint_rotate_tables_step1() advances the pointer, LOCK TABLE ... NOWAIT followed by TRUNCATE on the next-up table. TRUNCATE does not produce dead tuples; it unlinks the relation file. Reset, not garbage collected.

The interesting bit is the safety check before the truncate. The system finds the slowest consumer’s sub_last_tick, looks up that tick’s pg_snapshot_xmin, and compares it to queue_switch_step2 (the txid at which the previous rotation became visible). If the slowest consumer might still need to read events from the table that’s about to be truncated, the rotation is skipped this round and retried later. So a stuck consumer doesn’t lose events — it pins a table from being recycled, the same way a long transaction pins old MVCC tuples. The difference is that the pinning here is at table granularity, not row granularity, and the cost is “we can’t release this table yet” rather than “every operation on the queue gets slower.”

The choice of three tables (queue_ntables = 3) is the smallest number that gives you a current table, a previous table that consumers might still be reading, and an empty target for the next TRUNCATE. Two would work in the common case but breaks if you want to keep the previous table available for the full rotation period rather than only until the next rotation.

Where it costs you

The trade-off, which the project is honest about, is end-to-end delivery latency. A producer’s INSERT and a consumer’s next_batch are both sub-millisecond, but a freshly inserted event is invisible to the consumer until the next tick — the snapshot it was committed in has to be the current snapshot, not the last snapshot, of some batch. With the default 1-second pg_cron ticker, end-to-end delivery typically lands in the 1–2 second range. You can drive that down with staggered cron jobs or an in-tick sleep loop, but not below the per-tick cost of writing the tick row, which is itself a small INSERT plus an UPDATE of the per-queue sequence. PgQue inherits PgQ’s metadata-table bloat shape, which is bounded but real under sustained held-xmin conditions, on the small tick and subscription tables. The event tables remain pristine.

This is the right trade-off for an event log — a stream that fans out to multiple consumers. It is the wrong trade-off for sub-3ms job dispatch with strict per-job priority, where the answer is still a SKIP LOCKED queue (with xmin discipline and a partitioning plan), or something outside the database entirely. PgQue is closer to a Kafka topic in disguise than to RabbitMQ, and the README is correct to position it that way.

Now that we know that, what do we know?

Even if you never install PgQue, the algorithm is worth carrying around. pg_visible_in_snapshot() and the pg_snapshot type are both PostgreSQL built-ins, neither of them obscure, and the trick of “freeze a snapshot, diff against the next one, recover the set of transactions that committed in between” generalizes well beyond queues. Any time you find yourself reaching for a processed flag on a table because you need an idempotent “have I dealt with this row” indicator across multiple readers, the snapshot-pair diff is worth at least considering as an alternative. It avoids the mutation entirely. Marko Kreen figured this out twenty years ago and the rest of us mostly forgot.

PgQue’s contribution is not the algorithm. It’s that the algorithm now installs from a single SQL file on RDS, Aurora, Cloud SQL, AlloyDB, Supabase, and Neon, with no extension and no daemon. That’s a real piece of work. But the algorithm is the thing.

1	-- Producer
2	INSERT INTO jobs (payload) VALUES (...);
3
4	-- Consumer
5	WITH claimed AS (
6	SELECT id, payload FROM jobs
7	WHERE state = 'pending'
8	ORDER BY id
9	LIMIT 100
10	FOR UPDATE SKIP LOCKED
11	)
12	UPDATE jobs SET state = 'processing'
13	WHERE id IN (SELECT id FROM claimed)
14	RETURNING id, payload;
15
16	-- After processing
17	DELETE FROM jobs WHERE id = ANY($1);

1	create table pgque.tick (
2	tick_queue int4 not null,
3	tick_id bigint not null,
4	tick_time timestamptz not null default now(),
5	tick_snapshot pg_snapshot not null default pg_current_snapshot(),
6	tick_event_seq bigint not null,
7	primary key (tick_queue, tick_id)
8	);

1	SELECT ev_id, ev_time, ev_txid, ev_type, ev_data, ...
2	FROM pgque.event_<queue>_<n> ev,
3	pgque.tick last,
4	pgque.tick cur
5	WHERE last.tick_id = $sub_last_tick
6	AND cur.tick_id = $sub_next_tick
7	AND pg_visible_in_snapshot(ev.ev_txid, cur.tick_snapshot)
8	AND NOT pg_visible_in_snapshot(ev.ev_txid, last.tick_snapshot)
9	ORDER BY 1;

1	select id1
2	from pg_snapshot_xip(last_snapshot) id1
3	left join pg_snapshot_xip(cur_snapshot) id2 on id1 = id2
4	where id2 is null;