9 May 2023
10:58
I shouldn’t have to say this, but don’t use ChatGPT for technical advice.
In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space).
LLMs are not a replacement for expertise.
3 May 2023
09:55
Recently on one of the PostgreSQL mailing lists, someone wrote in asking if it was possible to get PostgreSQL to listen on two ports. The use case, to paraphrase, was that there was a heterogeneous mix of clients, some of which could connect with TLS, some of which couldn’t. They wanted the clients that could use TLS to do so, while allowing the non-TLS clients access.
The simple answer is: Upgrade your non-TLS clients already! But of course the world is a complicated place. And for reasons that weren’t given (but which we will accept for now), it has to be two different ports.
The PostgreSQL server itself can only listen on one port. But there were two options presented that could fix this:
- Run pgbouncer with TLS turned on, on a different port, and have it forward the connections to the PostgreSQL server via a local socket.
- Run stunnel to listen for TLS connections, and route those to PostgreSQL.
I don’t imagine many people will have this exact situation, but if you do… there are options!
2 May 2023
10:23
I’ll be speaking about Writing a Foreign Data Wrapper at PGCon 2023 in Ottawa, May 30-June 2, 2023. Do come! It’s the premiere technical/hacker conference for PostgreSQL.
30 March 2023
01:14
In a comment on my earlier post on max_wal_size
, Lukas Fittl asked a perfectly reasonable question:
Re: “The only thing it costs you is disk space; there’s no other problem with it being too large.”
Doesn’t this omit the fact that a higher max_wal_size
leads to longer recovery times after a crash? In my experience that was the reason why you wouldn’t want max_wal_size
to e.g. be 100GB, since it means your database might take a while to get back up and running after crashes.
The answer is… as you might expect, tricky.
The reason is that there are two different ways a checkpoint can be started in PostgreSQL (in regular operations, that is; there’s a few more, such as manual CHECKPOINT
commands and the start of a backup using pg_start_backup
). Those are when PostgreSQL thinks it needs to checkpoint to avoid overrunning max_wal_size
(by too much), and when checkpoint_timeout
is reached. It starts a checkpoint on the first of those that it hits.
The theory behind my recommendations on checkpoint tuning is to increase max_wal_size
to the point that you are sure that it is always checkpoint_timeout
that fires rather than max_wal_size
. That in effect caps the checkpoint interval, so larger values of max_wal_size
don’t change the checkpoint behavior once it has reached the level that checkpoint_timeout
is always the reason a checkpoint starts.
But Lukas does raise a very good point: the time it takes to recover a PostgreSQL system from a crash is proportionate to the amount of WAL that it has to replay, in bytes, and that’s soft-capped by max_wal_size
. If crash recovery speed is a concern, it might make sense to not go crazy with max_wal_size
, and cap it at a lower level.
Pragmatically, crashes are not common and checkpoints are very common, so I recommend optimizing for checkpoint performance rather than recovery time… but if your system is very sensitive to recovery time, going crazy with max_wal_size
is probably not a good idea.
23 March 2023
13:21
The reality is that most PostgreSQL configuration parameters don’t have a huge impact on overall system performance. There are, however, a couple that really can make a huge difference when tuned from the defaults. work_mem
is one of them, and max_wal_size
is another.
max_wal_size
controls how large the write-ahead log can get on disk before PostgreSQL does a checkpoint. It’s not a hard limit; PostgreSQL adapts checkpoint frequency to keep the WAL on disk no larger than that, but excursions above it can definitely happen. The only thing it costs you is disk space; there’s no other problem with it being too large.
Having max_wal_size
too small can cause checkpoints to happen very frequently. Frequent checkpointing is bad for two reasons:
- Checkpoints themselves are expensive, since all of the dirty buffers in
shared_buffers
need to be written out.
- The first time a page is changed after a checkpoint, the entire page is written to the WAL rather than just the change. On a busy system, this can be a very significant burst of WAL activity.
Here’s a process to set max_wal_size
properly:
First, set the general checkpoint parameters. This is a good start:
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_compression = on
log_checkpoints = on
max_wal_size = 16GB
Then, let the system run, and check the logs (or any other tools you may have to determine checkpoint frequency). If the checkpoints are happening more frequently than every 15 minutes, increase max_wal_size
until they are being triggered by the timeout.
How about min_wal_size
? This controls the amount of reserved WAL files that PostgreSQL will retain on disk even if it doesn’t need it for other reasons. This can speed up the WAL slightly, since PostgreSQL can use one of those retained files instead of having to create a new one. There’s no harm in bumping it up (again, all it costs is disk space), but on nearly every environment, the performance impact is small.
21 March 2023
07:10
The slides from my presentation “Real-World Logical Replication” are now available.
06:48
The slides are now available for my talk “Database Antipatterns, and where to find them” at SCaLE 20x.
13 March 2023
13:27
If you google around for how to set work_mem
in PostgreSQL, you’ll probably find something like:
To set work_mem, take the number of connections, add 32, divide by your astrological sign expressed as a number (Aquarius is 1), convert it to base 7, and then read that number in decimal megabytes.
So, I am here to tell you that every formula setting work_mem
is wrong. Every. Single. One. They may not be badly wrong, but they are at best first cuts and approximations.
The problem is that of all the parameters you can set in PostgreSQL, work_mem
is about the most workload dependent. You are trying to balance two competing things:
- First, you want to set it high enough that PostgreSQL does as many of the operations as it can (generally, sorts and sort-adjacent operations) in memory rather than on secondary storage, since it’s much faster to do them in memory, but:
- You want it to be low enough that you don’t run out of memory while you are doing these things, because the query will then get canceled unexpectedly and, you know, people talk.
You can prevent the second situation with a formula. For example, you can use something like:
50% of free memory + file system buffers divided by the number of connections.
The chance of running out of memory using that formula is very low. It’s not zero, because a single query can use more than work_mem
if there are multiple execution nodes demanding it in a query, but that’s very unlikely. It’s even less likely that every connection will be running a query that has multiple execution nodes that require full work_mem
; the system will have almost certainly melted down well before that.
The problem with using a formula like that is that you are, to mix metaphors, leaving RAM on the table. For example, on a 48GB server with max_connections = 1000
, you end up with with a work_mem
in the 30MB range. That means that a query that needs 64MB, even if it is the only one on the system that needs that much memory, will be spilled to disk while there’s a ton of memory sitting around available.
So, here’s what you do:
- Use a formula like that to set
work_mem
, and then run the system under a realistic production load with log_temp_files = 0
set.
- If everything works fine and you see no problems and performance is 100% acceptable, you’re done.
- If not, go into the logs and look for temporary file creation messages. They look something like this:
2023-03-13 13:19:03.863 PDT,,,45466,,640f8503.b19a,1,,2023-03-13 13:18:11 PDT,6/28390,0,LOG,00000,"temporary file: path ""base/pgsql_tmp/pgsql_tmp45466.0"", size 399482880",,,,,,"explain analyze select f from t order by f;",,,"psql","parallel worker",44989,0
- If there aren’t any, you’re done, the performance issue isn’t temporary file creation.
- If there are, the setting for
work_mem
to get rid of them is 2 times the largest temporary file (temporary files have less overhead than memory operations).
Of course, that might come up with something really absurd, like 2TB. Unless you know for sure that only one query like that might be running at a time (and you really do have enough freeable memory), you might have to make some decisions about performance vs memory usage. It can be very handy to run the logs throughs through an analyzer like pgbadger
to see what the high water mark is for temporary file usage at any one time.
If you absolutely must use a formula (for example, you are deploying a very large fleet of servers with varying workloads and instance sizes and you have to put something in the Terraform script), we’ve had good success with:
(average freeable memory * 4) / max_connections
But like every formula, that’s at best an approximation. If you want an accurate number that maximizes performance without causing out-of-memory issues, you have to gather data and analyze it.
Sorry for any inconvenience.
3 March 2023
08:35
I’m currently scheduled to speak at:
I hope to see you at one of these!
28 February 2023
20:07
Over the course of the last few versions, PostgreSQL has introduces all kinds of background worker processes, including workers to do various kinds of things in parallel. There are enough now that it’s getting kind of confusing. Let’s sort them all out.
You can think of each setting as creating a pool of potential workers. Each setting draws its workers from a “parent” pool. We can visualize this as a Venn diagram:
max_worker_processes
sets the overall size of the worker process pool. You can never have more than that many background worker processes in the system at once. This only applies to background workers, not the main backend processes that handle connections, or the various background processes (autovacuum daemon, WAL writer, etc.) that PostgreSQL uses for its own operations.
From that pool, you can create up to max_parallel_workers
parallel execution worker processes. These come in two types:
Parallel maintenance workers, that handle parallel activities in index creation and vacuuming. max_parallel_maintenance_workers
sets the maximum number that can exist at one time.
Parallel query workers. These processes are started automatically to parallelize queries. The maximum number here isn’t set directly; instead, it is set by max_parallel_workers_per_gather
. That’s the maximum number of processes that one gather
execute node can start. Usually, there’s only one gather
node per query, but complex queries can use multiple sets of parallel workers (much like a query can have multiple nodes that all use work_mem
).
So, what shall we set these to?
Background workers that are not parallel workers are not common in PostgreSQL at the moment, with one notable exception: logical replication workers. The maximum number of these are set by the parameter max_logical_replication_workers
. What to set that parameter to is a subject for another post. I recommend starting the tuning with max_parallel_workers
, since that’s going to be the majority of worker processes going at any one time. A good starting value is 2-3 times the number of cores in the server running PostgreSQL. If there are a lot of cores (32 to 64 or more), 1.5 times might be more appropriate.
For max_worker_processes
, a good place to start is to sum:
max_parallel_workers
max_logical_replication_workers
- And an additional 4-8 extra background worker slots.
Then, consider max_parallel_workers_per_gather
. If you routinely processes large result sets, increasing it from the default of 2 to 4-6 is reasonable. Don’t go crazy here; a query rapidly reaches a point of diminishing returns in spinning up new parallel workers.
For max_parallel_maintenance_workers
, 4-6 is also a good value. Go with 6 if you have a lot of cores, 4 if you have more than eight cores, and 2 otherwise.
Remember that every worker in parallel query execution can individually consume up to work_mem
in working memory. Set that appropriately for the total number of workers that might be running at any one time. Note that it’s not just work_mem
x max_parallel_workers_per_gather
! Each individual worker can use more than work_mem
if it has multiple operations that require it, and any non-parallel queries can do so as well.
Finally, max_parallel_workers
, max_parallel_maintenance_workers
, and max_parallel_workers_per_gather
can be set for an individual session (or role, etc.), so if you are going to run an operation that will benefit from a large number of parallel workers, you can increase it for just that query. Note that the overall pool is still limited by max_worker_processes
, and changing that requires a server restart.