01:29

Streaming replication stopped? One more thing to check.

27 October 2017

We recently were asked by a client to look at a strange replication situation. Replication had simply stopped from a primary to a streaming replication secondary. Everything looked fine:

The secondary was connected to the primary, as we could see in pg_stat_replication.
The secondary was not receiving any queries, so delays in applying incoming page changes weren’t an issue.
The primary was receiving a modest number of changes, so it wasn’t that there were no changes at all, or that the volume was so high that the secondary was strugging to keep up.
There were no errors in the logs.
The only symptom was that replay_location in pg_stat_replication wasn’t advancing, even as the changes were being sent down to the secondary.
The only anomaly that we could find was that, on the secondary, there were a huge number of entries in pg_locks for the PostgreSQL startup process (over 10,000!), all ExclusiveLocks on relations that didn’t actually exist.

After investigation, and some back-and-forth on mailing lists, we discovered the problem was a data ingestion procedure. At the highest level, the stored procedure created a temporary table, filled it, did some data analysis on it, and then inserted the results into a permanent table. Two changes had been made just before the replication stopped:

The number of temporary tables created in each “run” of the stored procedure had been increased from 10 to 20,000.
A bug had been introduced that caused an exception within the loop, but that exception had been caught and discarded.

Logically, what was happening resembled this:

DO $$
DECLARE
  i int := 1;
BEGIN
  FOR i IN 1..20000 LOOP
     BEGIN
         PERFORM f();
         i := i / 0;
     EXCEPTION
        WHEN division_by_zero THEN
     END;
  END LOOP;
END;
$$ language plpgsql;

where f() is:

CREATE OR REPLACE FUNCTION f() RETURNS VOID AS $$
BEGIN
 CREATE TEMPORARY TABLE test_table 
   ON COMMIT DROP AS SELECT i FROM generate_series(1, 100) i;
END:
$$ language plpgsql;

So, what happened?

Each time PostgreSQL creates a temporary table (or any table, for that), it takes an ExclusiveLock on it. That lock is held until the end of the transaction. In this case, each time the exception was raised, that rolled back a savepoint that PL/pgSQL implicitly creates at the start of the BEGIN/END, releasing the lock.

But only on the primary.

The secondary received the lock (one of the only types of lock a secondary can have), and held it until the entire wrapping transaction commits or rolls back. This are a lot of locks in this case. PostgreSQL handles this situation, but not entirely gracefully: the time needed up clean up all those locks when the COMMIT arrives at the secondary appears to be O(N^2), N being the number of locks.

Reducing the number of iterations (and fixing the bug) caused the issue to go away.

Interestingly, we’ve not been able to reproduce in a simple test case the lack-of-error-messages; in the test case above, all kinds of errors are emitted on the secondary.

Special thanks to Tom Lane for helping track this down.

There is one comment.

P T from okmeter.io at 01:19, 31 October 2017:

Typo:
This is a lto of locks in this case.