18 January 2023
00:00
There’s a particular anti-pattern in database design that PostgreSQL handles… not very well.
For example, let’s say you are building something like Twitch. (The real Twitch doesn’t work this way! At least, not as far as I know!) So, you have streams, and you have users, and users watch streams. So, let’s do a schema!
CREATE TABLE stream (stream_id bigint PRIMARY KEY);
CREATE TABLE "user" (user_id bigint PRIMARY KEY);
CREATE TABLE stream_viewer (
stream_id bigint REFERENCES stream(stream_id),
user_id bigint REFERENCES "user"(user_id),
PRIMARY KEY (stream_id, user_id));
OK, schema complete! Not bad for a day’s work. (Note the double quotes around "user"
. USER
is a keyword in PostgreSQL, so we have to put it in double quotes to use as a table name. This is not great practice, but more about double quotes some other time.)
Let’s say we persuade a very popular streamer over to our platform. They go on-line, and all 1,252,136 of our users simultaneously log on and start following that stream.
So, we now have to insert 1,252,136 new records into stream_viewer
. That’s pretty bad. But what’s worse is now we have 1,252,136 records with a foreign key relationship to a single record in stream
. During the operation of the INSERT
statement, the transaction that is doing the INSERT
will take a FOR KEY SHARE
lock on that record. This means that at any one moment, several thousand different transactions will have a FOR KEY SHARE
lock on that record.
This is very bad.
If more than one transaction at a time has a lock on a single record, the MultiXact system handles this. MultiXact puts a special transaction ID in the record that’s locked, and then builds an external data structure that holds all of the transaction IDs that have locked the record. This works great… up to a certain size. But that data structure is of fixed size, and when it fills up, it spills onto secondary storage.
As you might imagine, that’s slow. You can see this with lots of sessions suddenly waiting on various MultiXact
* lightweight locks.
You can get around this in a few ways:
- Don’t have that foreign key. Of course, you also then lose referential integrity, so if the
stream
record is deleted, there may still be lots of stream_viewer
records that now have an invalid foreign key.
- Batch up the join operations. That way, one big transaction is doing the
INSERT
s instead of a large number of small ones. This can make a big difference in both locking behavior, and general system throughput. (For extra credit, use a COPY
instead of an INSERT
to process the batch.)
Not many systems have this particular design issue. (You would never actually build a streaming site using that schema, just to start.) But if you do, this particular behavior is a good thing to avoid.
16 January 2023
00:00
Previously, I wrote that you should never lock tables. And you usually shouldn’t! But sometimes, there’s a good reason to. Here’s one.
When you are doing a schema-modifying operation, like adding a column to a table, PostgreSQL needs to take an ACCESS EXCLUSIVE
lock on the table while it is modifying the system catalogs. Unless it needs to rewrite the table, this lock isn’t held for very long.
However, locks in PostgreSQL are first-come, first-served. If the system is busy, there may be conflicting locks on the table that you are attempting to modify. (Even just a SELECT
statement takes lock on the tables it is operating on; it just doesn’t conflict with much.) If the ALTER TABLE
statement can’t get the lock right away, it enters a queue, waiting to get to the front and get the lock.
However, now, every lock after that enters the queue, too, behind that ALTER TABLE
. This can create the result of a long-running ACCESS EXCLUSIVE
lock, even though it’s not granted. On a busy table on a busy system, this can shut things down.
So, what to do?
You can do this:
DO $$
BEGIN
FOR i IN 1 .. 1000 LOOP
BEGIN
LOCK TABLE t NOWAIT;
ALTER TABLE t ADD COLUMN i BIGINT;
RETURN;
EXCEPTION WHEN lock_not_available THEN
PERFORM pg_sleep(1);
CONTINUE;
END;
END LOOP;
RAISE lock_not_available;
END;
$$;
This loops until it can acquire the lock, but doesn’t sit in the queue if it can’t. Once it acquires the lock, it does the modification and exits. If it can’t acquire the lock after a certain number of cycles, it exits with an error (you can set the number of cycles to anything, and you can adjust time it sleeps after failing to get the lock).
15 January 2023
00:00
In PostgreSQL, NUMERIC
is a variable length type of fixed precision. You can have as many digits as you want (and you want to pay the storage for). DOUBLE PRECISION
is a floating point type, with variable precision.
Sometimes, the question comes up: How much slower is NUMERIC
than DOUBLE PRECISION
, anyway?
Here’s a quick, highly unscientific benchmark:
Doing a simple test (100 million rows), a straight SUM()
across a NUMERIC
was about 2.2x slower than OUBLE PRECISION
. It went up to about 4x slower if there was a simple calculation, SUM(n*12)
. It was about 5x slower if the calculation involved the same type, SUM(n*n)
. Of course, these are just on my laptop, but I would expect that the ratios would remain constant on other machines.
Inserting the 100 million rows took 72.2 seconds for DOUBLE PRECISION
, 146.2 seconds for NUMERIC
. The resulting table size was 3.5GB for DOUBLE PRECISION
, 4.2GB for NUMERIC
.
So, yes, NUMERIC
is slower. But it’s not absurdly slower. NUMERIC
is much slower than bigint
(exercise left to the reader), so using NUMERIC
for things like primary keys is definitely not a good idea.
15 November 2019
10:00
If you have something interesting to day about PostgreSQL, we [would love to get a proposal from you]. Even if you have never spoken before, consider responding to the CfP! PgDay 2020 is particularly friendly to first-time and inexperienced speakers. You’re among friends! If you use PostgreSQL, you almost certainly have opinions and experiences that others would love to hear about… go for it!
13 November 2019
13:26
The very first PgDay San Francisco is coming to the Swedish-American Hall on January 21, 2020. It’s going to be an amazing event.
If you have something to say about PostgreSQL…
… the Call for Proposals is now open through November 22, 2019. We are looking for 40 minute talks about anything related to PostgreSQL. First-time speakers are particularly encouraged to send in proposals.
If you are interested in or use PostgreSQL…
… Early-Bird Tickets are now available! Attendance is limited, so be sure to get your seat now.
If your company uses PostgreSQL…
… consider sponsoring the event! We can’t do it without our sponsors, and it is a great way to recruit PostgreSQL people. Show off your company to the PostgreSQL community!
20 March 2019
06:34
The slides from my presentation at PGDay Nordic 2019 are now available.
11 February 2019
15:44
A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots.
The documentation is accurate, but it doesn’t really describe why one might want to do such a thing.
First, what is a “snapshot”? You can think of a snapshot as the current set of committed tuples in the database, a consistent view of the database. When you start a transaction and set it to REPEATABLE READ
mode, the snapshot remains consistent throughout the transaction, even if other sessions commit transactions. (In the default transaction mode, READ COMMITTED
, each statement starts a new snapshot, so newly committed work could appear between statements within the transaction.)
However, each snapshot is local to a single transaction. But suppose you wanted to write a tool that connected to the database in multiple sessions, and did analysis or extraction? Since each session has its own transaction, and the transactions start asynchronously from each other, they could have different views of the database depending on what other transactions got committed. This might generate inconsistent or invalid results.
This isn’t theoretical: Suppose you are writing a tool like pg_dump
, with a parallel dump facility. If different sessions got different views of the database, the resulting dump would be inconsistent, which would make it useless as a backup tool!
The good news is that we have the ability to “synchronize” various sessions so that they all use the same base snapshot.
First, a transaction opens and sets itself to REPEATABLE READ
or SERIALIZABLE
mode (there’s no point in doing exported snapshots in READ COMMITTED
mode, since the snapshot will get replaced at the very next statement). Then, that session calls pg_export_snapshot
. This creates an identifier for the current transaction snapshot.
Then, the client running the first session passes that identifier to the clients that will be using it. You’ll need to do this via some non-database channel. For example, you can’t use LISTEN
/ NOTIFY
, since the message isn’t actually sent until COMMIT
time.
Each client that receives the snapshot ID can then do SET TRANSACTION SNAPSHOT ...
to use the snapshot. The client needs to call this before it does any work in the session (even SELECT
). Now, each of the clients has the same view into the database, and that view will remain until it COMMIT
s or ABORT
s.
Note that each transaction is still fully autonomous; the various sessions are not “inside” the same transaction. They can’t see each other’s work, and if two different clients modify the database, those modifications are not visible to any other session, including the ones that are sharing the snapshot. You can think of the snapshot as the “base” view of the database, but each session can modify it (subject, of course, to the usual rules involved in modifying the same tuples, or getting serialization failures).
This is a pretty specialized use-case, of course; not many applications need to have multiple sessions with a consistent view of the database. But if you do, PostgreSQL has the facilities to do it!
8 February 2019
18:52
PostgreSQL has two autovacuum-age related settings, autovacuum_freeze_max_age
, and vacuum_freeze_table_age
.
Both of them are in terms of the transaction “age” of a table: That is, how long it has been since the table has been scanned completely for “old” tuples that can be marked as “frozen” (a “frozen” tuple is one that no open transaction can cause to disappear by a rollback). In short, the “oldest” a table can become in PostgreSQL is 2^31-1 transactions; if a table were ever to reach that, data loss would occur. PostgreSQL takes great pains to prevent you from eaching that point.
The “vacuum freeze” process is the process that scans the table and marks these tuples as frozen.
vacuum_freeze_table_age
causes a regular autovacuum run to be an “autovacuum (to prevent xid wraparound)” run, that is, an (auto)vacuum freeze, if the age of the table is higher than vacuum_freeze_table_age
.
autovacuum_freeze_max_age
will cause PostgreSQL to start an “autovacuum (to prevent xid wraparound)” run even if it has no other reason to vacuum the table, should a table age exceed that setting.
By default, vacuum_freeze_table_age
= 100000000 (one hundred million), and autovacuum_freeze_max_age
= 200000000 (two hundred million).
Do not change them.
In the past, I made a recommendation I now deeply regret. Because, before 9.6, each autovacuum freeze run scanned the entire table, and (on its first pass) potentially rewrote the entire table, it could be very high I/O, and when it woke up suddenly, it could cause performance issues. I thus recommended two things:
- Increase
autovacuum_freeze_max_age
and vacuum_freeze_table_age
, and,
- Do manual VACUUM FREEZE operations on the “oldest” tables during low-traffic periods.
Unfortunately, far too many installations adopted recommendation #1, but didn’t do #2. The result was that they cranked up autovacuum_freeze_max_age
so high that by the time the mandatory autovacuum freeze operation began, they were so close to transaction XID wraparound point, they had no choice but to take the system offline and do the operation in single-user mode.
Thus, I am forever rescinding that advice. Especially now that 9.6 is incremental, the I/O penalty of an autovacuum freeze is greatly reduced, and the consequences of not doing it are severe.
Don’t increase those parameters. Let autovacuum freeze do its job. If you want to stay ahead of it, we have a script to do opportunistic freezing that might be helpful.
4 February 2019
14:00
The slides for my talk, “Breaking PostgreSQL at Scale” at FOSDEM 2019 are available.
26 September 2018
11:06
Upon occasion, you want to get the current value of all the sequences in the database. For example, you may have fields that are integer
rather than bigint
, and you’re concerned how close you are to overflowing one of them (since sequences are bigint
and will happily crash through the size of a 32-bit integer
).
Unfortunately, currval()
requires that you have actually accessed the sequence value using nextval()
in the current session… but you don’t want to increment the value, just test it!
Here’s a cheap and cheerful PL/pgSQL function that returns all of the current sequence values:
CREATE OR REPLACE FUNCTION sequence_values() RETURNS TABLE(name text, value bigint) AS $sequence_values$
DECLARE
nsp_name TEXT;
seq_name TEXT;
BEGIN
FOR nsp_name, seq_name IN
SELECT nspname::text, relname::text
FROM pg_class
JOIN pg_namespace
ON pg_class.relnamespace = pg_namespace.oid WHERE relkind='S'
LOOP
RETURN QUERY EXECUTE 'SELECT ''' || nsp_name || '.' || seq_name || '''::text, last_value FROM "' || nsp_name || '"."' || seq_name || '"';
END LOOP;
END;
$sequence_values$
LANGUAGE plpgsql;