postgresql when it's not your job

20:24

A Cheap and Cheerful Replication Check

28 December 2016

On a PostgreSQL primary / secondary pair, it’s very important to monitor replication lag. Increasing replication lag is often the first sign of trouble, such as a network issue, the secondary disconnecting for some reason (or for no reason at all, which does happen rarely), disk space issues, etc.

You can find all kinds of complex scripts that do math on the various WAL positions that are available from the secondary and from pgstatreplication on the primary.

Or you can do this. It’s very cheap and cheerful, and for many installations, it gets the job done.

First, on the primary (and thus on the secondary), we create a one-column table:

CREATE TABLE replication_monitor (
   last_timestamp TIMESTAMPTZ
);

Then, we insert a singel row into the table (you can probably already see where this is going):

INSERT INTO replication_monitor VALUES(now());

Having that, we can start a cron job that runs every minute, updating that value:

* * * * * /usr/bin/psql -U postgres -c "update replication_monitor set last_update=now()" postgres > /dev/null

On the secondary (which is kept in sync with the primary via NTP, so make sure ntpd is running on both!), we have a script, also run from cron, that complains if the value in has fallen more than a certain amount behind now(). Here’s a (pretty basic) Python 2 version:

#!/usr/bin/python

import sys

import psycopg2

conn = psycopg2.connect("dbname=postgres user=postgres")

cur = conn.cursor()

cur.execute("select (now()-last_update)>'5 minutes'::interval from replication_monitor")

problem = cur.fetchone()[0]

if problem:
    print >>sys.stderr, "replication lag over 5 minutes."

We make sure we get the output from stderr for cron jobs on the secondary, set it up to run every so often, and we’re done!

Of course, this has its limitations:

It has the advantage that it works even if the server is otherwise not taking traffic, since it creates traffic all by itself.

As a final note, check_postgres has this check integrated as one of the (many, many) checks it can do, as the replicate_row check. If you are using check_postgres, by all means use that one!

Vladimir at 01:54, 29 December 2016:

If you need better resolution, you can use https://github.com/dev1ant/repl_mon bgwroker for that. Some details could be found in https://simply.name/postgresql-replication-monitoring.html.