18 January 2023
There’s a particular anti-pattern in database design that PostgreSQL handles… not very well.
For example, let’s say you are building something like Twitch. (The real Twitch doesn’t work this way! At least, not as far as I know!) So, you have streams, and you have users, and users watch streams. So, let’s do a schema!
CREATE TABLE stream (stream_id bigint PRIMARY KEY);
CREATE TABLE "user" (user_id bigint PRIMARY KEY);
CREATE TABLE stream_viewer (
stream_id bigint REFERENCES stream(stream_id),
user_id bigint REFERENCES "user"(user_id),
PRIMARY KEY (stream_id, user_id));
OK, schema complete! Not bad for a day’s work. (Note the double quotes around
USER is a keyword in PostgreSQL, so we have to put it in double quotes to use as a table name. This is not great practice, but more about double quotes some other time.)
Let’s say we persuade a very popular streamer over to our platform. They go on-line, and all 1,252,136 of our users simultaneously log on and start following that stream.
So, we now have to insert 1,252,136 new records into
stream_viewer. That’s pretty bad. But what’s worse is now we have 1,252,136 records with a foreign key relationship to a single record in
stream. During the operation of the
INSERT statement, the transaction that is doing the
INSERT will take a
FOR KEY SHARE lock on that record. This means that at any one moment, several thousand different transactions will have a
FOR KEY SHARE lock on that record.
This is very bad.
If more than one transaction at a time has a lock on a single record, the MultiXact system handles this. MultiXact puts a special transaction ID in the record that’s locked, and then builds an external data structure that holds all of the transaction IDs that have locked the record. This works great… up to a certain size. But that data structure is of fixed size, and when it fills up, it spills onto secondary storage.
As you might imagine, that’s slow. You can see this with lots of sessions suddenly waiting on various
MultiXact* lightweight locks.
You can get around this in a few ways:
- Don’t have that foreign key. Of course, you also then lose referential integrity, so if the
stream record is deleted, there may still be lots of
stream_viewer records that now have an invalid foreign key.
- Batch up the join operations. That way, one big transaction is doing the
INSERTs instead of a large number of small ones. This can make a big difference in both locking behavior, and general system throughput. (For extra credit, use a
COPY instead of an
INSERT to process the batch.)
Not many systems have this particular design issue. (You would never actually build a streaming site using that schema, just to start.) But if you do, this particular behavior is a good thing to avoid.
16 January 2023
Previously, I wrote that you should never lock tables. And you usually shouldn’t! But sometimes, there’s a good reason to. Here’s one.
When you are doing a schema-modifying operation, like adding a column to a table, PostgreSQL needs to take an
ACCESS EXCLUSIVE lock on the table while it is modifying the system catalogs. Unless it needs to rewrite the table, this lock isn’t held for very long.
However, locks in PostgreSQL are first-come, first-served. If the system is busy, there may be conflicting locks on the table that you are attempting to modify. (Even just a
SELECT statement takes lock on the tables it is operating on; it just doesn’t conflict with much.) If the
ALTER TABLE statement can’t get the lock right away, it enters a queue, waiting to get to the front and get the lock.
However, now, every lock after that enters the queue, too, behind that
ALTER TABLE. This can create the result of a long-running
ACCESS EXCLUSIVE lock, even though it’s not granted. On a busy table on a busy system, this can shut things down.
So, what to do?
You can do this:
FOR i IN 1 .. 1000 LOOP
LOCK TABLE t NOWAIT;
ALTER TABLE t ADD COLUMN i BIGINT;
EXCEPTION WHEN lock_not_available THEN
This loops until it can acquire the lock, but doesn’t sit in the queue if it can’t. Once it acquires the lock, it does the modification and exits. If it can’t acquire the lock after a certain number of cycles, it exits with an error (you can set the number of cycles to anything, and you can adjust time it sleeps after failing to get the lock).
15 January 2023
NUMERIC is a variable length type of fixed precision. You can have as many digits as you want (and you want to pay the storage for).
DOUBLE PRECISION is a floating point type, with variable precision.
Sometimes, the question comes up: How much slower is
DOUBLE PRECISION, anyway?
Here’s a quick, highly unscientific benchmark:
Doing a simple test (100 million rows), a straight
SUM() across a
NUMERIC was about 2.2x slower than
OUBLE PRECISION. It went up to about 4x slower if there was a simple calculation,
SUM(n*12). It was about 5x slower if the calculation involved the same type,
SUM(n*n). Of course, these are just on my laptop, but I would expect that the ratios would remain constant on other machines.
Inserting the 100 million rows took 72.2 seconds for
DOUBLE PRECISION, 146.2 seconds for
NUMERIC. The resulting table size was 3.5GB for
DOUBLE PRECISION, 4.2GB for
NUMERIC is slower. But it’s not absurdly slower.
NUMERIC is much slower than
bigint (exercise left to the reader), so using
NUMERIC for things like primary keys is definitely not a good idea.
15 November 2019
If you have something interesting to day about PostgreSQL, we [would love to get a proposal from you]. Even if you have never spoken before, consider responding to the CfP! PgDay 2020 is particularly friendly to first-time and inexperienced speakers. You’re among friends! If you use PostgreSQL, you almost certainly have opinions and experiences that others would love to hear about… go for it!
13 November 2019
The very first PgDay San Francisco is coming to the Swedish-American Hall on January 21, 2020. It’s going to be an amazing event.
If you have something to say about PostgreSQL…
… the Call for Proposals is now open through November 22, 2019. We are looking for 40 minute talks about anything related to PostgreSQL. First-time speakers are particularly encouraged to send in proposals.
If you are interested in or use PostgreSQL…
… Early-Bird Tickets are now available! Attendance is limited, so be sure to get your seat now.
If your company uses PostgreSQL…
… consider sponsoring the event! We can’t do it without our sponsors, and it is a great way to recruit PostgreSQL people. Show off your company to the PostgreSQL community!
20 March 2019
The slides from my presentation at PGDay Nordic 2019 are now available.
11 February 2019
A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots.
The documentation is accurate, but it doesn’t really describe why one might want to do such a thing.
First, what is a “snapshot”? You can think of a snapshot as the current set of committed tuples in the database, a consistent view of the database. When you start a transaction and set it to
REPEATABLE READ mode, the snapshot remains consistent throughout the transaction, even if other sessions commit transactions. (In the default transaction mode,
READ COMMITTED, each statement starts a new snapshot, so newly committed work could appear between statements within the transaction.)
However, each snapshot is local to a single transaction. But suppose you wanted to write a tool that connected to the database in multiple sessions, and did analysis or extraction? Since each session has its own transaction, and the transactions start asynchronously from each other, they could have different views of the database depending on what other transactions got committed. This might generate inconsistent or invalid results.
This isn’t theoretical: Suppose you are writing a tool like
pg_dump, with a parallel dump facility. If different sessions got different views of the database, the resulting dump would be inconsistent, which would make it useless as a backup tool!
The good news is that we have the ability to “synchronize” various sessions so that they all use the same base snapshot.
First, a transaction opens and sets itself to
REPEATABLE READ or
SERIALIZABLE mode (there’s no point in doing exported snapshots in
READ COMMITTED mode, since the snapshot will get replaced at the very next statement). Then, that session calls
pg_export_snapshot. This creates an identifier for the current transaction snapshot.
Then, the client running the first session passes that identifier to the clients that will be using it. You’ll need to do this via some non-database channel. For example, you can’t use
NOTIFY, since the message isn’t actually sent until
Each client that receives the snapshot ID can then do
SET TRANSACTION SNAPSHOT ... to use the snapshot. The client needs to call this before it does any work in the session (even
SELECT). Now, each of the clients has the same view into the database, and that view will remain until it
Note that each transaction is still fully autonomous; the various sessions are not “inside” the same transaction. They can’t see each other’s work, and if two different clients modify the database, those modifications are not visible to any other session, including the ones that are sharing the snapshot. You can think of the snapshot as the “base” view of the database, but each session can modify it (subject, of course, to the usual rules involved in modifying the same tuples, or getting serialization failures).
This is a pretty specialized use-case, of course; not many applications need to have multiple sessions with a consistent view of the database. But if you do, PostgreSQL has the facilities to do it!
8 February 2019
PostgreSQL has two autovacuum-age related settings,
Both of them are in terms of the transaction “age” of a table: That is, how long it has been since the table has been scanned completely for “old” tuples that can be marked as “frozen” (a “frozen” tuple is one that no open transaction can cause to disappear by a rollback). In short, the “oldest” a table can become in PostgreSQL is 2^31-1 transactions; if a table were ever to reach that, data loss would occur. PostgreSQL takes great pains to prevent you from eaching that point.
The “vacuum freeze” process is the process that scans the table and marks these tuples as frozen.
vacuum_freeze_table_age causes a regular autovacuum run to be an “autovacuum (to prevent xid wraparound)” run, that is, an (auto)vacuum freeze, if the age of the table is higher than
autovacuum_freeze_max_age will cause PostgreSQL to start an “autovacuum (to prevent xid wraparound)” run even if it has no other reason to vacuum the table, should a table age exceed that setting.
vacuum_freeze_table_age = 100000000 (one hundred million), and
autovacuum_freeze_max_age = 200000000 (two hundred million).
Do not change them.
In the past, I made a recommendation I now deeply regret. Because, before 9.6, each autovacuum freeze run scanned the entire table, and (on its first pass) potentially rewrote the entire table, it could be very high I/O, and when it woke up suddenly, it could cause performance issues. I thus recommended two things:
- Do manual VACUUM FREEZE operations on the “oldest” tables during low-traffic periods.
Unfortunately, far too many installations adopted recommendation #1, but didn’t do #2. The result was that they cranked up
autovacuum_freeze_max_age so high that by the time the mandatory autovacuum freeze operation began, they were so close to transaction XID wraparound point, they had no choice but to take the system offline and do the operation in single-user mode.
Thus, I am forever rescinding that advice. Especially now that 9.6 is incremental, the I/O penalty of an autovacuum freeze is greatly reduced, and the consequences of not doing it are severe.
Don’t increase those parameters. Let autovacuum freeze do its job. If you want to stay ahead of it, we have a script to do opportunistic freezing that might be helpful.
4 February 2019
The slides for my talk, “Breaking PostgreSQL at Scale” at FOSDEM 2019 are available.
26 September 2018
Upon occasion, you want to get the current value of all the sequences in the database. For example, you may have fields that are
integer rather than
bigint, and you’re concerned how close you are to overflowing one of them (since sequences are
bigint and will happily crash through the size of a 32-bit
currval() requires that you have actually accessed the sequence value using
nextval() in the current session… but you don’t want to increment the value, just test it!
Here’s a cheap and cheerful PL/pgSQL function that returns all of the current sequence values:
CREATE OR REPLACE FUNCTION sequence_values() RETURNS TABLE(name text, value bigint) AS $sequence_values$
FOR nsp_name, seq_name IN
SELECT nspname::text, relname::text
ON pg_class.relnamespace = pg_namespace.oid WHERE relkind='S'
RETURN QUERY EXECUTE 'SELECT ''' || nsp_name || '.' || seq_name || '''::text, last_value FROM "' || nsp_name || '"."' || seq_name || '"';