postgresql when it's not your job

18 July 2018

09:00

pg_rewind and checkpoints: caution!

pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, there was no entirely safe way of doing so without simply rebuilding the secondaries using pg_basebackup (for example), which could take a very long time on large databases.

pg_rewind works by connecting to (or having direct file-system level access to) the new primary, and uses the WAL information to “back up” the target system (the old master or old peer secondaries) to the point that they can reattach as secondaries to the new primary primary.

It works wonderfully… but like anything powerful, it has some warnings associated with it. Here’s one of them.

In a recent situation, a client was doing a flip-back-and-forth stress test, in which a secondary would be promoted, its former primary rewound, the two reattached, and then back again. This worked well for many iterations, but after one particularly fast iteration, the new secondary (former primary, now rewound) wouldn’t come back up; the message was:

requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103

What happened?

When PostgreSQL exits recovery, it starts a forced checkpoint; you can see this in the logs with a message:

checkpoint starting: force
database system is ready to accept connections

Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.

The hypothesis is (not 100% confirmed, but seems likely) that the pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the pg_control file, and one of the first things that pg_rewind does is read the control file from the source machine, there’s a window where the pg_control file on disk will be out of date. The result is a rather confused situation on the new secondary.

Thus, if you are going to do a pg_rewind, make sure the initial checkpoint on the source machine has completed before doing the rewind. In a scripted environment, this can be tricky, since the initial checkpoint can take any amount of time depending on how much has to be flushed out to disk. One option is to issue a CHECKPOINT statement to the newly-promoted primary to make sure a checkpoint has been completed; this does a fast checkpoint, and doesn’t return until the checkpoint is completed. (Edited to recommend CHECKPOINT.)

(Thanks to Michael Paquier and Andres Freund for pointing me in the right direction here.)

17 July 2018

10:38

That Google Checksum Tool

Google recently released a tool to check on-disk checksums in PostgreSQL. PostgreSQL being hot, and Google being Google, this generated a lot of press, much of it containing an element of FUD about the integrity of PostgreSQL backups (note that Google’s own announcements were quite straight-forward about it).

First, some background: Since PostgreSQL 9.3, it has had page checksums available as an option when initializing a new database. These are fast, lightweight checksums to verify that the data on the page hasn’t been corrupted. It’s an excellent and low-impact data integrity feature, and highly recommended.

However, PostgreSQL doesn’t just randomly check the checksums of pages it is not going to read into memory (and why would it?). Thus, there can be lurking corruption in database that you don’t discover until too late… specifically, after all of the backups also contain it. Some PostgreSQL page-level backup tools, such as pgBackRest, check the checksums during the backup, but other tools (such as WAL-E and WAL-G) do not.

Google’s tool adds to that collection by doing an on-disk scan of each page, and verifying the checksum. That’s all it does. That’s not useless! It’s relevant to backups because doing this scan on a backup image doesn’t create the large I/O burden that doing so on an active server would.

So, if you are using a backup tool that does not verify backups, and that creates a disk image (as opposed to say, pg_dump) by all means consider this new addition as part of your toolbox. But there’s nothing inherent about PostgreSQL backups that makes them more prone to corruption, and some of the unfortunate press around this has given that implication.

16 July 2018

20:33

One of those things: /tmp/pgpool_status

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…)

When setting up pgpool2, it’s common to tweak the configuration file repeatedly… and often get a hostname wrong. One common occurrence is then:

Often, the problem is that the pgpool_status file, usually kept in /tmp, has cached the status of one of the hosts. The pgpool_status file is retained across pgpool2 restarts. Deleting the pgpool_status file will fix the issue.

(Why, yes, this just happened to me today…)

10 May 2018

12:00

PostgreSQL Replication at PerconaLive 2018

The slides from my talk on PostgreSQL Replication at PerconaLive 2018 are available.

10 April 2018

18:36

Change these: wal_level and wal_hint_bits

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

wal_level = logical

wal_level controls how much information is written into the write-ahead log. Higher settings write more information, enabling more features, at the expense of (somewhat) larger write-ahead log volume.

Just set it to logical, the highest setting. The incremental volume is not that significant, and you are then ready to support logical replication without a restart in case you feel the need.

(If you are running 9.3 or earlier, the highest wal_level setting is hot_standby, but you should have upgraded by now.)

wal_log_hints = on # if you do not have data page checksums enabled.

This setting enables the logging of certain hint-bit changes. Without going into gruesome detail, this is required for certain tools (in particular, pg_rewind) to operate properly. Just turn it on.

If the cluster was initialized with data page checksums, you’re getting hint bit logging anyway, so there’s no need to turn it on (although it doesn’t hurt to do so; it’s ignored on clusters with data page checksums).

6 April 2018

08:00

Change these: Logging

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

Here’s my preferred logging configuration:

log_destination = 'csvlog'
logging_collector = on
log_directory = '/var/log/postgresql'
log_filename = 'postgresql-%Y%m%d-%H%M%S.log'
log_rotation_size = 1GB
log_rotation_age = 1d
log_min_duration_statement = 250ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_statement = 'ddl'
log_temp_files = 0
log_autovacuum_min_duration = 1000

Of course, you can adjust these to taste. Set log_directory and log_filename as appropriate for your own directory structure and retention policy. (As listed, the log files will build up forever, so some sort of compress-and-delete strategy is required here.) You can set log_min_duration_statement as appropriate to control log volume; if you can get away with log_min_duration_statement = 0, great.

The others should be set as listed on any system; they don’t take up that much in terms of log volume, and the information provided is very helpful.

3 April 2018

13:00

Change these: archive_mode and archive_command

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

By default archive_mode is set to off, and thus archive_command is ignored. Even if you are not going to be using WAL archiving, you should change these. You can use settings such as:

archive_mode = on
archive_command = '/bin/true'

It requires a server restart to turn on archive_mode, but just a reload to change archive_command. Thus, you’re ready to start WAL archiving without a restart by getting these set in advance.

30 March 2018

21:28

Change This: wal_compression

(An intermittent series on PostgreSQL parameters whose default settings you should change.)

Introduced in PostgreSQL 9.5, wal_compression is off by default, but you should probably turn it on.

First, what does it do? The documentation helpfully explains:

When this parameter is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup.

By default (that is, when full_page_writes = on, which is the default setting), the first time a data page is changed after a checkpoint, the entire data page is written into the write-ahead log. (Thomas Vondra has a good explanation of this and why it is done).

This results in a significant burst of extra WAL traffic after a checkpoint. (This is one of the reasons to increase the time period between checkpoints). This setting, when on, compresses those full pages images so that they don’t take up quite so much WAL space, reducing both the disk required to hold the WAL, and the amount of network traffic required to send the WAL information down to a streaming replica.

It’s almost always a benefit to turn it on. The only time it might hurt performance is if the database is heavily CPU-limited, so that the (small) extra CPU required to compress the WAL information is significant. That’s unlikely, so turn it on.

27 March 2018

18:12

Mountpoints and the Single PostgreSQL Server

Ultimately, a PostgreSQL database is just files, and those files have to be stored somewhere. Absent tablespaces, they are all stored in a single directory structure, traditionally called PGDATA. While smaller instllations can just use the defaults associated with the packaging, larger databases are often installed on their own volume with its own mountpoint.

Two rules to keep in mind when choosing a name for the mountpoint and the directory structure:

  1. Always include the major version number in the directory, but,
  2. Never include the major version in the mountpoint.

For example, /pgsql is a great mountpoint name, but /pgsql/10 isn’t. Instead, call the mountpoint /pgsql and call PGDATA /pgsql/10/data (for example) under it.

Why?

At some point, you’ll probably want to have more than one major version installed on the same server. For example, you may want to use pg_upgrade to upgrade between major versions. In --link mode (which is the most sensible way to use pg_upgrade), you need to create a new cluster for the major version, but that new cluster needs to be on the same file system as the old cluster, so that pg_upgrade can create hard links between the data files.

If the major version is baked into the mountpoint, you start getting things like /pgdata/9.6/10 and similar awful situations.

So, make life easy for yourself and make the mountpoint generic, and use a directory path with the major version in it for PGDATA.

20 March 2018

06:28

Why PostgreSQL is Terrible? at Nordic PGDay

The slides from my talk at Nordic PGDay, Why PostgreSQL is Terrible, are now available.

« Older Entries

Newer Entries »