11 February 2019
15:44
A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots.
The documentation is accurate, but it doesn’t really describe why one might want to do such a thing.
First, what is a “snapshot”? You can think of a snapshot as the current set of committed tuples in the database, a consistent view of the database. When you start a transaction and set it to REPEATABLE READ
mode, the snapshot remains consistent throughout the transaction, even if other sessions commit transactions. (In the default transaction mode, READ COMMITTED
, each statement starts a new snapshot, so newly committed work could appear between statements within the transaction.)
However, each snapshot is local to a single transaction. But suppose you wanted to write a tool that connected to the database in multiple sessions, and did analysis or extraction? Since each session has its own transaction, and the transactions start asynchronously from each other, they could have different views of the database depending on what other transactions got committed. This might generate inconsistent or invalid results.
This isn’t theoretical: Suppose you are writing a tool like pg_dump
, with a parallel dump facility. If different sessions got different views of the database, the resulting dump would be inconsistent, which would make it useless as a backup tool!
The good news is that we have the ability to “synchronize” various sessions so that they all use the same base snapshot.
First, a transaction opens and sets itself to REPEATABLE READ
or SERIALIZABLE
mode (there’s no point in doing exported snapshots in READ COMMITTED
mode, since the snapshot will get replaced at the very next statement). Then, that session calls pg_export_snapshot
. This creates an identifier for the current transaction snapshot.
Then, the client running the first session passes that identifier to the clients that will be using it. You’ll need to do this via some non-database channel. For example, you can’t use LISTEN
/ NOTIFY
, since the message isn’t actually sent until COMMIT
time.
Each client that receives the snapshot ID can then do SET TRANSACTION SNAPSHOT ...
to use the snapshot. The client needs to call this before it does any work in the session (even SELECT
). Now, each of the clients has the same view into the database, and that view will remain until it COMMIT
s or ABORT
s.
Note that each transaction is still fully autonomous; the various sessions are not “inside” the same transaction. They can’t see each other’s work, and if two different clients modify the database, those modifications are not visible to any other session, including the ones that are sharing the snapshot. You can think of the snapshot as the “base” view of the database, but each session can modify it (subject, of course, to the usual rules involved in modifying the same tuples, or getting serialization failures).
This is a pretty specialized use-case, of course; not many applications need to have multiple sessions with a consistent view of the database. But if you do, PostgreSQL has the facilities to do it!
8 February 2019
18:52
PostgreSQL has two autovacuum-age related settings, autovacuum_freeze_max_age
, and vacuum_freeze_table_age
.
Both of them are in terms of the transaction “age” of a table: That is, how long it has been since the table has been scanned completely for “old” tuples that can be marked as “frozen” (a “frozen” tuple is one that no open transaction can cause to disappear by a rollback). In short, the “oldest” a table can become in PostgreSQL is 2^31-1 transactions; if a table were ever to reach that, data loss would occur. PostgreSQL takes great pains to prevent you from eaching that point.
The “vacuum freeze” process is the process that scans the table and marks these tuples as frozen.
vacuum_freeze_table_age
causes a regular autovacuum run to be an “autovacuum (to prevent xid wraparound)” run, that is, an (auto)vacuum freeze, if the age of the table is higher than vacuum_freeze_table_age
.
autovacuum_freeze_max_age
will cause PostgreSQL to start an “autovacuum (to prevent xid wraparound)” run even if it has no other reason to vacuum the table, should a table age exceed that setting.
By default, vacuum_freeze_table_age
= 100000000 (one hundred million), and autovacuum_freeze_max_age
= 200000000 (two hundred million).
Do not change them.
In the past, I made a recommendation I now deeply regret. Because, before 9.6, each autovacuum freeze run scanned the entire table, and (on its first pass) potentially rewrote the entire table, it could be very high I/O, and when it woke up suddenly, it could cause performance issues. I thus recommended two things:
- Increase
autovacuum_freeze_max_age
and vacuum_freeze_table_age
, and,
- Do manual VACUUM FREEZE operations on the “oldest” tables during low-traffic periods.
Unfortunately, far too many installations adopted recommendation #1, but didn’t do #2. The result was that they cranked up autovacuum_freeze_max_age
so high that by the time the mandatory autovacuum freeze operation began, they were so close to transaction XID wraparound point, they had no choice but to take the system offline and do the operation in single-user mode.
Thus, I am forever rescinding that advice. Especially now that 9.6 is incremental, the I/O penalty of an autovacuum freeze is greatly reduced, and the consequences of not doing it are severe.
Don’t increase those parameters. Let autovacuum freeze do its job. If you want to stay ahead of it, we have a script to do opportunistic freezing that might be helpful.
4 February 2019
14:00
The slides for my talk, “Breaking PostgreSQL at Scale” at FOSDEM 2019 are available.
26 September 2018
11:06
Upon occasion, you want to get the current value of all the sequences in the database. For example, you may have fields that are integer
rather than bigint
, and you’re concerned how close you are to overflowing one of them (since sequences are bigint
and will happily crash through the size of a 32-bit integer
).
Unfortunately, currval()
requires that you have actually accessed the sequence value using nextval()
in the current session… but you don’t want to increment the value, just test it!
Here’s a cheap and cheerful PL/pgSQL function that returns all of the current sequence values:
CREATE OR REPLACE FUNCTION sequence_values() RETURNS TABLE(name text, value bigint) AS $sequence_values$
DECLARE
nsp_name TEXT;
seq_name TEXT;
BEGIN
FOR nsp_name, seq_name IN
SELECT nspname::text, relname::text
FROM pg_class
JOIN pg_namespace
ON pg_class.relnamespace = pg_namespace.oid WHERE relkind='S'
LOOP
RETURN QUERY EXECUTE 'SELECT ''' || nsp_name || '.' || seq_name || '''::text, last_value FROM "' || nsp_name || '"."' || seq_name || '"';
END LOOP;
END;
$sequence_values$
LANGUAGE plpgsql;
12 September 2018
20:22
The slides from my presentation, Securing PostgreSQL at PDXPUG PostgreSQL Day 2018 are now available.
4 September 2018
12:00
In addition to the familiar text types VARCHAR
and TEXT
, PostgreSQL has a type CHAR
. It’s little used… and that’s for a reason. It has some very unusual behaviors, which can be quite a surprise if you are not expecting them.
First, CHAR
is a fixed-width type. When character data is stored in it, it’s padded out with spaces if it is not full length:
xof=# create table chars (c char(20));
CREATE TABLE
xof=# insert into chars values('x');
INSERT 0 1
xof=# select * from chars;
c
----------------------
x
(1 row)
OK, that’s reasonable, right? But what is going on here?
xof=# select length(c) from chars;
length
--------
1
(1 row)
xof=# select substring(c from 8 for 1) = ' '::char(1) from chars;
?column?
----------
t
(1 row)
xof=# select substring(c from 8 for 1) = ' '::varchar(1) from chars;
?column?
----------
f
(1 row)
xof=# select length(substring(c from 8 for 1)) from chars;
length
--------
0
(1 row)
xof=# select c || 'y' from chars;
?column?
----------
xy
(1 row)
CHAR
, when actually used, first trims off all trailing spaces, then applies the operation. It is trying to simulate a variable-length type, for historic reasons. This can be quite surprising, since a supposedly fixed-length type suddenly starts behaving as if it were variable. Unless you are terribly nostalgic for punched cards, CHAR
is generally not what you want.
Is there ever a time to use CHAR
? Not really. If you have a single-character enumeration that can never be either ” or ‘ ‘ (a single space), it might be more logical to store it as CHAR(1)
rather than VARCHAR
, but any space savings will be minimal and highly dependent on the alignment of the surrounding items.
And for n > 1, just use VARCHAR
… or TEXT
. (Remember that in PostgreSQL, VARCHAR
and TEXT
are stored the same way.)
28 August 2018
17:00
It’s not uncommon that an application needs to serialize access to one or more resources. The temptation is very high to use the LOCK TABLE SQL statement to do this.
Resist the temptation.
There are many issues with using LOCK:
- It blocks autovacuum, which can cause bloat or even transaction ID wraparound in extreme cases.
- An ACCESS EXCLUSIVE lock (the default mode) is passed down to secondaries, which can block queries there, or even cause deadlock-type situations.
- It’s easy to cause deadlocks with bad LOCKing order.
If the goal is to serialize access, consider using advisory locks instead. They have all of the benefits of a LOCK on a table, while not actually blocking access to autovacuum, or access on secondaries.
(Yes, some database tools may need to take explicit locks for a variety of reasons; that’s a different matter, of course.)
9 August 2018
10:06
pg_rewind
is a utility included with PostgreSQL since 9.x. It’s used to “rewind” a server so that it can be attached as a secondary to a primary. The server being rewound could be the former primary of the new primary, or a secondary that was a peer of the new primary.
In pg_rewind
terminology, and in this post, the “source” server is the new primary that the old server is going to be attached to, and the “target” is the server that will be attached to the source as a secondary.
Step One: Have a WAL Archive
While pg_rewind
does not require that you have a WAL archive, you should have one. pg_rewind
works by “backing up” the target server to a state before the last shared checkpoint of the two servers. Then, when the target starts up, it uses WAL information to replay itself to the appropriate point at which it can connect as a streaming replica to the source. To do that, it needs the WAL information from the rewind point onwards. Since the source had no reason to “know” that it would be used as a primary, it may not have enough WAL information in its pgxlog / pgwal directory to bring the target up to date. If it doesn’t, you are back to rebuilding the new secondary, the exact situation that pg_rewind
is meant to avoid.
Thus, make sure you have a WAL archive that the target can consult as it is coming up.
Step Two: Properly Promote the Source Server
The source server, which will be the new primary, needs to be properly promoted. Use the pg_ctl promote
option, or the trigger_file
option in recovery.conf
so that the source promotes itself, and starts a new timeline. Don’t just shut the source down, remove recovery.conf, and bring it back up! That doesn’t create a new timeline, and the source won’t have the appropriate divergence point from the target for pg_rewind
to consult.
Step Three: Wait for the Forced Checkpoint to Complete
When a secondary is promoted to being a primary, it starts a forced checkpoint when it exits recovery mode. This checkpoint is a “fast” checkpoint, but it can still take a while, depending on how big shared_buffers
is and how many buffers are dirty. Use tail -f
to monitor the logs on the source and wait for that forced checkpoint to complete before running pg_rewind
to rewind the target. Failing to do this can cause the target to be corrupted. If you are writing a script to do this, issue a CHECKPOINT
statement to the source before running pg_rewind
.
And have fun rewinding servers!
7 August 2018
17:58
PostgreSQL has a variety of ways of telling time: now()
, statement_timestamp()
, and clock_timestamp()
. Each has a different sense of when “now” is:
now()
is the time at the start of the transaction; it never changes while the current transaction is open.
statement_timestamp()
is the time that the current statement started running. It changes from statement to statement, but is constant within a statement (which means it is constant within a PL/pgSQL function).
clock_timestamp()
changes each time it is called, regardless of context.
Each has its uses:
- For a predicate in a WHERE clause, you want either
now()
or statement_timestamp()
. These work properly with indexes, because they are constant within the execution of a statement.
- If you need the time to update within a single transaction, use
statement_timestamp()
; otherwise, use now()
.
- Generally, you only use
clock_timestamp()
inside of a programming language procedure so you can get the current timestamp.
23 July 2018
09:00
(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…)
Since version 1.7, pgbouncer
has had built-in TLS support. This is very handy, since the previous versions required stunnel
or something similar to provide TLS.
Frequently, either for test purposes or because you’re in an environment where checking client certificates isn’t required, pgbouncer
is set up using a self-signed client certificate. It’s easy to forget that you need to set the certificate authority parameter to point to the certificate file in this case, but you do:
client_tls_sslmode = allow
client_tls_key_file = /etc/pgbouncer/pgbouncer.key
client_tls_cert_file = /etc/pgbouncer/pgbouncer.pem
client_tls_ca_file = /etc/pgbouncer/pgbouncer.pem ; don't forget this one!
Note that it’s generally not a great idea to use a self-signed certificate in production, since you are vulnerable to man-in-the-middle attacks in that case.