28 December 2016
On a PostgreSQL primary / secondary pair, it’s very important to monitor replication lag. Increasing replication lag is often the first sign of trouble, such as a network issue, the secondary disconnecting for some reason (or for no reason at all, which does happen rarely), disk space issues, etc.
You can find all kinds of complex scripts that do math on the various WAL positions that are available from the secondary and from pgstatreplication on the primary.
Or you can do this. It’s very cheap and cheerful, and for many installations, it gets the job done.
First, on the primary (and thus on the secondary), we create a one-column table:
CREATE TABLE replication_monitor ( last_timestamp TIMESTAMPTZ );
Then, we insert a singel row into the table (you can probably already see where this is going):
INSERT INTO replication_monitor VALUES(now());
Having that, we can start a cron job that runs every minute, updating that value:
* * * * * /usr/bin/psql -U postgres -c "update replication_monitor set last_update=now()" postgres > /dev/null
On the secondary (which is kept in sync with the primary via NTP, so make sure ntpd is running on both!), we have a script, also run from cron, that complains if the value in has fallen more than a certain amount behind
now(). Here’s a (pretty basic) Python 2 version:
#!/usr/bin/python import sys import psycopg2 conn = psycopg2.connect("dbname=postgres user=postgres") cur = conn.cursor() cur.execute("select (now()-last_update)>'5 minutes'::interval from replication_monitor") problem = cur.fetchone() if problem: print >>sys.stderr, "replication lag over 5 minutes."
We make sure we get the output from
stderr for cron jobs on the secondary, set it up to run every so often, and we’re done!
Of course, this has its limitations:
It only has ±2 minutes of resolution, based on how often cron runs. For a basic “is replication working?” check, this is probably fine.
It creates traffic in the replication stream and WAL, but if you are really worried about a TIMESTAMPTZ’s worth of update once a minute, you probably have other things to worry about.
It has the advantage that it works even if the server is otherwise not taking traffic, since it creates traffic all by itself.
As a final note, check_postgres has this check integrated as one of the (many, many) checks it can do, as the
replicate_row check. If you are using check_postgres, by all means use that one!