The Build

01:29

Streaming replication stopped? One more thing to check.

27 October 2017

We recently were asked by a client to look at a strange replication situation. Replication had simply stopped from a primary to a streaming replication secondary. Everything looked fine:

After investigation, and some back-and-forth on mailing lists, we discovered the problem was a data ingestion procedure. At the highest level, the stored procedure created a temporary table, filled it, did some data analysis on it, and then inserted the results into a permanent table. Two changes had been made just before the replication stopped:

  1. The number of temporary tables created in each “run” of the stored procedure had been increased from 10 to 20,000.
  2. A bug had been introduced that caused an exception within the loop, but that exception had been caught and discarded.

Logically, what was happening resembled this:

DO $$
DECLARE
  i int := 1;
BEGIN
  FOR i IN 1..20000 LOOP
     BEGIN
         PERFORM f();
         i := i / 0;
     EXCEPTION
        WHEN division_by_zero THEN
     END;
  END LOOP;
END;
$$ language plpgsql;

where f() is:

CREATE OR REPLACE FUNCTION f() RETURNS VOID AS $$
BEGIN
 CREATE TEMPORARY TABLE test_table 
   ON COMMIT DROP AS SELECT i FROM generate_series(1, 100) i;
END:
$$ language plpgsql;

So, what happened?

Each time PostgreSQL creates a temporary table (or any table, for that), it takes an ExclusiveLock on it. That lock is held until the end of the transaction. In this case, each time the exception was raised, that rolled back a savepoint that PL/pgSQL implicitly creates at the start of the BEGIN/END, releasing the lock.

But only on the primary.

The secondary received the lock (one of the only types of lock a secondary can have), and held it until the entire wrapping transaction commits or rolls back. This is a lto of locks in this case. PostgreSQL handles this situation, but not entirely gracefully: the time needed up clean up all those locks when the COMMIT arrives at the secondary appears to be O(N^2), N being the number of locks.

Reducing the number of iterations (and fixing the bug) caused the issue to go away.

Interestingly, we’ve not been able to reproduce in a simple test case the lack-of-error-messages; in the test case above, all kinds of errors are emitted on the secondary.

Special thanks to Tom Lane for helping track this down.

Comments are closed.