postgresql when it's not your job

28 August 2018

17:00

Don’t LOCK tables. Just don’t.

It’s not uncommon that an application needs to serialize access to one or more resources. The temptation is very high to use the LOCK TABLE SQL statement to do this.

Resist the temptation.

There are many issues with using LOCK:

It blocks autovacuum, which can cause bloat or even transaction ID wraparound in extreme cases.
An ACCESS EXCLUSIVE lock (the default mode) is passed down to secondaries, which can block queries there, or even cause deadlock-type situations.
It’s easy to cause deadlocks with bad LOCKing order.

If the goal is to serialize access, consider using advisory locks instead. They have all of the benefits of a LOCK on a table, while not actually blocking access to autovacuum, or access on secondaries.

(Yes, some database tools may need to take explicit locks for a variety of reasons; that’s a different matter, of course.)

no comments

9 August 2018

10:06

Three Steps to pg_rewind Happiness

pg_rewind is a utility included with PostgreSQL since 9.x. It’s used to “rewind” a server so that it can be attached as a secondary to a primary. The server being rewound could be the former primary of the new primary, or a secondary that was a peer of the new primary.

In pg_rewind terminology, and in this post, the “source” server is the new primary that the old server is going to be attached to, and the “target” is the server that will be attached to the source as a secondary.

Step One: Have a WAL Archive

While pg_rewind does not require that you have a WAL archive, you should have one. pg_rewind works by “backing up” the target server to a state before the last shared checkpoint of the two servers. Then, when the target starts up, it uses WAL information to replay itself to the appropriate point at which it can connect as a streaming replica to the source. To do that, it needs the WAL information from the rewind point onwards. Since the source had no reason to “know” that it would be used as a primary, it may not have enough WAL information in its pgxlog / pgwal directory to bring the target up to date. If it doesn’t, you are back to rebuilding the new secondary, the exact situation that pg_rewind is meant to avoid.

Thus, make sure you have a WAL archive that the target can consult as it is coming up.

Step Two: Properly Promote the Source Server

The source server, which will be the new primary, needs to be properly promoted. Use the pg_ctl promote option, or the trigger_file option in recovery.conf so that the source promotes itself, and starts a new timeline. Don’t just shut the source down, remove recovery.conf, and bring it back up! That doesn’t create a new timeline, and the source won’t have the appropriate divergence point from the target for pg_rewind to consult.

Step Three: Wait for the Forced Checkpoint to Complete

When a secondary is promoted to being a primary, it starts a forced checkpoint when it exits recovery mode. This checkpoint is a “fast” checkpoint, but it can still take a while, depending on how big shared_buffers is and how many buffers are dirty. Use tail -f to monitor the logs on the source and wait for that forced checkpoint to complete before running pg_rewind to rewind the target. Failing to do this can cause the target to be corrupted. If you are writing a script to do this, issue a CHECKPOINT statement to the source before running pg_rewind.

And have fun rewinding servers!

no comments

7 August 2018

17:58

Does anyone really know what time it is?

PostgreSQL has a variety of ways of telling time: now(), statement_timestamp(), and clock_timestamp(). Each has a different sense of when “now” is:

now() is the time at the start of the transaction; it never changes while the current transaction is open.
statement_timestamp() is the time that the current statement started running. It changes from statement to statement, but is constant within a statement (which means it is constant within a PL/pgSQL function).
clock_timestamp() changes each time it is called, regardless of context.

Each has its uses:

For a predicate in a WHERE clause, you want either now() or statement_timestamp(). These work properly with indexes, because they are constant within the execution of a statement.
If you need the time to update within a single transaction, use statement_timestamp(); otherwise, use now().
Generally, you only use clock_timestamp() inside of a programming language procedure so you can get the current timestamp.

2 comments

23 July 2018

09:00

One of those things: pgbouncer and self-signed certificates

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…)

Since version 1.7, pgbouncer has had built-in TLS support. This is very handy, since the previous versions required stunnel or something similar to provide TLS.

Frequently, either for test purposes or because you’re in an environment where checking client certificates isn’t required, pgbouncer is set up using a self-signed client certificate. It’s easy to forget that you need to set the certificate authority parameter to point to the certificate file in this case, but you do:

client_tls_sslmode = allow
client_tls_key_file = /etc/pgbouncer/pgbouncer.key
client_tls_cert_file = /etc/pgbouncer/pgbouncer.pem
client_tls_ca_file = /etc/pgbouncer/pgbouncer.pem  ; don't forget this one!

Note that it’s generally not a great idea to use a self-signed certificate in production, since you are vulnerable to man-in-the-middle attacks in that case.

no comments

18 July 2018

09:00

pg_rewind and checkpoints: caution!

pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, there was no entirely safe way of doing so without simply rebuilding the secondaries using pg_basebackup (for example), which could take a very long time on large databases.

pg_rewind works by connecting to (or having direct file-system level access to) the new primary, and uses the WAL information to “back up” the target system (the old master or old peer secondaries) to the point that they can reattach as secondaries to the new primary primary.

It works wonderfully… but like anything powerful, it has some warnings associated with it. Here’s one of them.

In a recent situation, a client was doing a flip-back-and-forth stress test, in which a secondary would be promoted, its former primary rewound, the two reattached, and then back again. This worked well for many iterations, but after one particularly fast iteration, the new secondary (former primary, now rewound) wouldn’t come back up; the message was:

requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103

What happened?

When PostgreSQL exits recovery, it starts a forced checkpoint; you can see this in the logs with a message:

checkpoint starting: force
database system is ready to accept connections

Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.

The hypothesis is (not 100% confirmed, but seems likely) that the pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the pg_control file, and one of the first things that pg_rewind does is read the control file from the source machine, there’s a window where the pg_control file on disk will be out of date. The result is a rather confused situation on the new secondary.

Thus, if you are going to do a pg_rewind, make sure the initial checkpoint on the source machine has completed before doing the rewind. In a scripted environment, this can be tricky, since the initial checkpoint can take any amount of time depending on how much has to be flushed out to disk. One option is to issue a CHECKPOINT statement to the newly-promoted primary to make sure a checkpoint has been completed; this does a fast checkpoint, and doesn’t return until the checkpoint is completed. (Edited to recommend CHECKPOINT.)

(Thanks to Michael Paquier and Andres Freund for pointing me in the right direction here.)

3 comments

17 July 2018

10:38

That Google Checksum Tool

Google recently released a tool to check on-disk checksums in PostgreSQL. PostgreSQL being hot, and Google being Google, this generated a lot of press, much of it containing an element of FUD about the integrity of PostgreSQL backups (note that Google’s own announcements were quite straight-forward about it).

First, some background: Since PostgreSQL 9.3, it has had page checksums available as an option when initializing a new database. These are fast, lightweight checksums to verify that the data on the page hasn’t been corrupted. It’s an excellent and low-impact data integrity feature, and highly recommended.

However, PostgreSQL doesn’t just randomly check the checksums of pages it is not going to read into memory (and why would it?). Thus, there can be lurking corruption in database that you don’t discover until too late… specifically, after all of the backups also contain it. Some PostgreSQL page-level backup tools, such as pgBackRest, check the checksums during the backup, but other tools (such as WAL-E and WAL-G) do not.

Google’s tool adds to that collection by doing an on-disk scan of each page, and verifying the checksum. That’s all it does. That’s not useless! It’s relevant to backups because doing this scan on a backup image doesn’t create the large I/O burden that doing so on an active server would.

So, if you are using a backup tool that does not verify backups, and that creates a disk image (as opposed to say, pg_dump) by all means consider this new addition as part of your toolbox. But there’s nothing inherent about PostgreSQL backups that makes them more prone to corruption, and some of the unfortunate press around this has given that implication.

1 comment

16 July 2018

20:33

One of those things: /tmp/pgpool_status

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…)

When setting up pgpool2, it’s common to tweak the configuration file repeatedly… and often get a hostname wrong. One common occurrence is then:

You fix the hostnames.
You restart pgpool2.
It refuses to connect to the host you just fixed.
You scratch your head, connect manually with psql, everything works…
But pgpool2 just will not acknowledge that host.

Often, the problem is that the pgpool_status file, usually kept in /tmp, has cached the status of one of the hosts. The pgpool_status file is retained across pgpool2 restarts. Deleting the pgpool_status file will fix the issue.

(Why, yes, this just happened to me today…)

no comments

10 May 2018

12:00

PostgreSQL Replication at PerconaLive 2018

The slides from my talk on PostgreSQL Replication at PerconaLive 2018 are available.

no comments

10 April 2018

18:36

Change these: wal_level and wal_hint_bits

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

wal_level = logical

wal_level controls how much information is written into the write-ahead log. Higher settings write more information, enabling more features, at the expense of (somewhat) larger write-ahead log volume.

Just set it to logical, the highest setting. The incremental volume is not that significant, and you are then ready to support logical replication without a restart in case you feel the need.

(If you are running 9.3 or earlier, the highest wal_level setting is hot_standby, but you should have upgraded by now.)

wal_log_hints = on # if you do not have data page checksums enabled.

This setting enables the logging of certain hint-bit changes. Without going into gruesome detail, this is required for certain tools (in particular, pg_rewind) to operate properly. Just turn it on.

If the cluster was initialized with data page checksums, you’re getting hint bit logging anyway, so there’s no need to turn it on (although it doesn’t hurt to do so; it’s ignored on clusters with data page checksums).

no comments

6 April 2018

08:00

Change these: Logging

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

Here’s my preferred logging configuration:

log_destination = 'csvlog'
logging_collector = on
log_directory = '/var/log/postgresql'
log_filename = 'postgresql-%Y%m%d-%H%M%S.log'
log_rotation_size = 1GB
log_rotation_age = 1d
log_min_duration_statement = 250ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_statement = 'ddl'
log_temp_files = 0
log_autovacuum_min_duration = 1000

Of course, you can adjust these to taste. Set log_directory and log_filename as appropriate for your own directory structure and retention policy. (As listed, the log files will build up forever, so some sort of compress-and-delete strategy is required here.) You can set log_min_duration_statement as appropriate to control log volume; if you can get away with log_min_duration_statement = 0, great.

The others should be set as listed on any system; they don’t take up that much in terms of log volume, and the information provided is very helpful.

1 comment