postgresql when it's not your job

Checking Your Privileges

Xof — Mon, 25 Mar 2024 17:00:00 +0000

The PostgreSQL roles and privileges system can be full of surprises.

Let’s say we have a database test, owned by user owner. In it, we create a very secret function f that we do not want just anyone to be able to execute:

test=> select current_user;
 current_user 
--------------
 owner
(1 row)

test=> CREATE FUNCTION f() RETURNS int as $$ SELECT 1; $$ LANGUAGE sql;
CREATE FUNCTION
test=> select f();
 f 
---
 1
(1 row)

There are two other users: hipriv and lowpriv. We want hipriv to be able to run the function, but not lowpriv. So, we grant EXECUTE to hipriv, but revoke it from lowpriv:

test=> GRANT EXECUTE ON FUNCTION f() TO hipriv;
GRANT
test=> REVOKE EXECUTE ON FUNCTION f() FROM lowpriv;
REVOKE

Let’s test it! We log in as hipriv and run the function:

test=> SELECT current_user;
 current_user 
--------------
 hipriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

Works great. Now, let’s try it as lowpriv:

test=> SELECT current_user;
 current_user 
--------------
 lowpriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

Wait, what? Why did it let lowpriv run f()? We explicitly revoked that permission! Is the PostgreSQL privileges system totally broken?

Well, no. But there are some surprises.

Let’s look at the privileges on f():

test=> SELECT proacl FROM pg_proc where proname = 'f';
                 proacl                  
-----------------------------------------
 {=X/owner,owner=X/owner,hipriv=X/owner}
(1 row)

The interpretation of each of the entries is “=/“. We see that owner has X (that is, EXECUTE) on f() granted by itself, and hipriv has EXECUTE granted by owner. But what’s with that first one that doesn’t have a role at the start? And where is our REVOKE on lowpriv?

The first thing that may be surprising is that there is no such thing as a REVOKE entry in the privileges. REVOKE removes a privilege that already exists; it doesn’t create a new entry that says “don’t allow this.” This means that unless there is already an entry that matches the REVOKE, REVOKE is a no-op.

The second thing is that if there is no role specified that, that means the special role PUBLIC. PUBLIC means “all roles.” So, anyone can execute f()! This is the default privilege for new functions.

Combined, this means that when the function was created, EXECUTE was granted to PUBLIC. The REVOKE was a no-op, because there was no explicit grant of privileges to lowpriv.

How do we fix it? First, we can revoke that undesirable first grant to PUBLIC:

test=> REVOKE EXECUTE ON FUNCTION f() FROM PUBLIC;
REVOKE

hipriv can still run the function, because we gave it an explicit grant:

test=> SELECT current_user;
 current_user 
--------------
 hipriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

But lowpriv can’t skate in under the grant to PUBLIC, so it can’t run the function anymore:

test=> SELECT current_user;
 current_user 
--------------
 lowpriv
(1 row)

test=> SELECT f();
ERROR:  permission denied for function f

The next thing we should do is alter the default privileges for new functions so that PUBLIC does not have EXECUTE privilege on them. You can do this with:

test=> ALTER DEFAULT PRIVILEGES FOR USER owner REVOKE EXECUTE ON FUNCTIONS FROM PUBLIC;
ALTER DEFAULT PRIVILEGES

This means any new functions created by the role owner will not have EXECUTE granted to PUBLIC. It’s important to remember that this does not change the privileges of any existing functions, and it only changes it for functions created by owner, not any other user or role.

So, if you are counting on the PostgreSQL privilege system to prevent roles from running functions (and accessing other objects), be sure you know what the default permissions are, and adjust them accordingly.

“Look It Up: Real-Life Database Indexing” at PgConf.NYC

Xof — Wed, 04 Oct 2023 15:16:47 +0000

The slides for my talk “Look It Up: Real-Life Database Indexing” are now available.

Don’t use ChatGPT to solve problems.

Xof — Tue, 09 May 2023 17:58:02 +0000

I shouldn’t have to say this, but don’t use ChatGPT for technical advice.

In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space).

LLMs are not a replacement for expertise.

Running PostgreSQL on two ports

Xof — Wed, 03 May 2023 16:55:29 +0000

Recently on one of the PostgreSQL mailing lists, someone wrote in asking if it was possible to get PostgreSQL to listen on two ports. The use case, to paraphrase, was that there was a heterogeneous mix of clients, some of which could connect with TLS, some of which couldn’t. They wanted the clients that could use TLS to do so, while allowing the non-TLS clients access.

The simple answer is: Upgrade your non-TLS clients already! But of course the world is a complicated place. And for reasons that weren’t given (but which we will accept for now), it has to be two different ports.

The PostgreSQL server itself can only listen on one port. But there were two options presented that could fix this:

Run pgbouncer with TLS turned on, on a different port, and have it forward the connections to the PostgreSQL server via a local socket.
Run stunnel to listen for TLS connections, and route those to PostgreSQL.

I don’t imagine many people will have this exact situation, but if you do… there are options!

“Writing a Foreign Data Wrapper” at PGCon 2023

Xof — Tue, 02 May 2023 17:23:27 +0000

I’ll be speaking about Writing a Foreign Data Wrapper at PGCon 2023 in Ottawa, May 30-June 2, 2023. Do come! It’s the premiere technical/hacker conference for PostgreSQL.

A little more on max_wal_size

Xof — Thu, 30 Mar 2023 08:14:41 +0000

In a comment on my earlier post on max_wal_size, Lukas Fittl asked a perfectly reasonable question:

Re: “The only thing it costs you is disk space; there’s no other problem with it being too large.”

Doesn’t this omit the fact that a higher max_wal_size leads to longer recovery times after a crash? In my experience that was the reason why you wouldn’t want max_wal_size to e.g. be 100GB, since it means your database might take a while to get back up and running after crashes.

The answer is… as you might expect, tricky.

The reason is that there are two different ways a checkpoint can be started in PostgreSQL (in regular operations, that is; there’s a few more, such as manual CHECKPOINT commands and the start of a backup using pg_start_backup). Those are when PostgreSQL thinks it needs to checkpoint to avoid overrunning max_wal_size (by too much), and when checkpoint_timeout is reached. It starts a checkpoint on the first of those that it hits.

The theory behind my recommendations on checkpoint tuning is to increase max_wal_size to the point that you are sure that it is always checkpoint_timeout that fires rather than max_wal_size. That in effect caps the checkpoint interval, so larger values of max_wal_size don’t change the checkpoint behavior once it has reached the level that checkpoint_timeout is always the reason a checkpoint starts.

But Lukas does raise a very good point: the time it takes to recover a PostgreSQL system from a crash is proportionate to the amount of WAL that it has to replay, in bytes, and that’s soft-capped by max_wal_size. If crash recovery speed is a concern, it might make sense to not go crazy with max_wal_size, and cap it at a lower level.

Pragmatically, crashes are not common and checkpoints are very common, so I recommend optimizing for checkpoint performance rather than recovery time… but if your system is very sensitive to recovery time, going crazy with max_wal_size is probably not a good idea.

The importance of max_wal_size

Xof — Thu, 23 Mar 2023 20:21:31 +0000

The reality is that most PostgreSQL configuration parameters don’t have a huge impact on overall system performance. There are, however, a couple that really can make a huge difference when tuned from the defaults. work_mem is one of them, and max_wal_size is another.

max_wal_size controls how large the write-ahead log can get on disk before PostgreSQL does a checkpoint. It’s not a hard limit; PostgreSQL adapts checkpoint frequency to keep the WAL on disk no larger than that, but excursions above it can definitely happen. The only thing it costs you is disk space; there’s no other problem with it being too large.

Having max_wal_size too small can cause checkpoints to happen very frequently. Frequent checkpointing is bad for two reasons:

Checkpoints themselves are expensive, since all of the dirty buffers in shared_buffers need to be written out.
The first time a page is changed after a checkpoint, the entire page is written to the WAL rather than just the change. On a busy system, this can be a very significant burst of WAL activity.

Here’s a process to set max_wal_size properly:

First, set the general checkpoint parameters. This is a good start:

checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_compression = on
log_checkpoints = on
max_wal_size = 16GB

Then, let the system run, and check the logs (or any other tools you may have to determine checkpoint frequency). If the checkpoints are happening more frequently than every 15 minutes, increase max_wal_size until they are being triggered by the timeout.

How about min_wal_size? This controls the amount of reserved WAL files that PostgreSQL will retain on disk even if it doesn’t need it for other reasons. This can speed up the WAL slightly, since PostgreSQL can use one of those retained files instead of having to create a new one. There’s no harm in bumping it up (again, all it costs is disk space), but on nearly every environment, the performance impact is small.

“Real-World Logical Replication” at Nordic PGDay 2023

Xof — Tue, 21 Mar 2023 14:10:20 +0000

The slides from my presentation “Real-World Logical Replication” are now available.

“Database Antipatterns, and where to find them” at SCaLE 20x

Xof — Tue, 21 Mar 2023 13:48:48 +0000

The slides are now available for my talk “Database Antipatterns, and where to find them” at SCaLE 20x.

Everything you know about setting `work_mem` is wrong.

Xof — Mon, 13 Mar 2023 20:27:35 +0000

If you google around for how to set work_mem in PostgreSQL, you’ll probably find something like:

To set work_mem, take the number of connections, add 32, divide by your astrological sign expressed as a number (Aquarius is 1), convert it to base 7, and then read that number in decimal megabytes.

So, I am here to tell you that every formula setting work_mem is wrong. Every. Single. One. They may not be badly wrong, but they are at best first cuts and approximations.

The problem is that of all the parameters you can set in PostgreSQL, work_mem is about the most workload dependent. You are trying to balance two competing things:

First, you want to set it high enough that PostgreSQL does as many of the operations as it can (generally, sorts and sort-adjacent operations) in memory rather than on secondary storage, since it’s much faster to do them in memory, but:
You want it to be low enough that you don’t run out of memory while you are doing these things, because the query will then get canceled unexpectedly and, you know, people talk.

You can prevent the second situation with a formula. For example, you can use something like:

50% of free memory + file system buffers divided by the number of connections.

The chance of running out of memory using that formula is very low. It’s not zero, because a single query can use more than work_mem if there are multiple execution nodes demanding it in a query, but that’s very unlikely. It’s even less likely that every connection will be running a query that has multiple execution nodes that require full work_mem; the system will have almost certainly melted down well before that.

The problem with using a formula like that is that you are, to mix metaphors, leaving RAM on the table. For example, on a 48GB server with max_connections = 1000, you end up with with a work_mem in the 30MB range. That means that a query that needs 64MB, even if it is the only one on the system that needs that much memory, will be spilled to disk while there’s a ton of memory sitting around available.

So, here’s what you do:

Use a formula like that to set work_mem, and then run the system under a realistic production load with log_temp_files = 0 set.
If everything works fine and you see no problems and performance is 100% acceptable, you’re done.
If not, go into the logs and look for temporary file creation messages. They look something like this: 2023-03-13 13:19:03.863 PDT,,,45466,,640f8503.b19a,1,,2023-03-13 13:18:11 PDT,6/28390,0,LOG,00000,"temporary file: path ""base/pgsql_tmp/pgsql_tmp45466.0"", size 399482880",,,,,,"explain analyze select f from t order by f;",,,"psql","parallel worker",44989,0
If there aren’t any, you’re done, the performance issue isn’t temporary file creation.
If there are, the setting for work_mem to get rid of them is 2 times the largest temporary file (temporary files have less overhead than memory operations).

Of course, that might come up with something really absurd, like 2TB. Unless you know for sure that only one query like that might be running at a time (and you really do have enough freeable memory), you might have to make some decisions about performance vs memory usage. It can be very handy to run the logs throughs through an analyzer like pgbadger to see what the high water mark is for temporary file usage at any one time.

If you absolutely must use a formula (for example, you are deploying a very large fleet of servers with varying workloads and instance sizes and you have to put something in the Terraform script), we’ve had good success with:

(average freeable memory * 4) / max_connections

But like every formula, that’s at best an approximation. If you want an accurate number that maximizes performance without causing out-of-memory issues, you have to gather data and analyze it.

Sorry for any inconvenience.

Upcoming Live Presentations

Xof — Fri, 03 Mar 2023 15:35:40 +0000

I’m currently scheduled to speak at:

SCaLE 20x in Pasadena, California, on Database Antipatterns and How to Find Them, March 9.
Nordic PgDay 2023 in Stockholm, on Real-World Logical Replication, March 21.
PgDay/MED 2023 in Malta, on Extreme PostgreSQL, April 13.

I hope to see you at one of these!

Workers of the World, Unite!

Xof — Wed, 01 Mar 2023 03:07:48 +0000

Over the course of the last few versions, PostgreSQL has introduces all kinds of background worker processes, including workers to do various kinds of things in parallel. There are enough now that it’s getting kind of confusing. Let’s sort them all out.

You can think of each setting as creating a pool of potential workers. Each setting draws its workers from a “parent” pool. We can visualize this as a Venn diagram:

max_worker_processes sets the overall size of the worker process pool. You can never have more than that many background worker processes in the system at once. This only applies to background workers, not the main backend processes that handle connections, or the various background processes (autovacuum daemon, WAL writer, etc.) that PostgreSQL uses for its own operations.

From that pool, you can create up to max_parallel_workers parallel execution worker processes. These come in two types:

Parallel maintenance workers, that handle parallel activities in index creation and vacuuming. max_parallel_maintenance_workers sets the maximum number that can exist at one time.
Parallel query workers. These processes are started automatically to parallelize queries. The maximum number here isn’t set directly; instead, it is set by max_parallel_workers_per_gather. That’s the maximum number of processes that one gather execute node can start. Usually, there’s only one gather node per query, but complex queries can use multiple sets of parallel workers (much like a query can have multiple nodes that all use work_mem).

So, what shall we set these to?

Background workers that are not parallel workers are not common in PostgreSQL at the moment, with one notable exception: logical replication workers. The maximum number of these are set by the parameter max_logical_replication_workers. What to set that parameter to is a subject for another post. I recommend starting the tuning with max_parallel_workers, since that’s going to be the majority of worker processes going at any one time. A good starting value is 2-3 times the number of cores in the server running PostgreSQL. If there are a lot of cores (32 to 64 or more), 1.5 times might be more appropriate.

For max_worker_processes, a good place to start is to sum:

max_parallel_workers
max_logical_replication_workers
And an additional 4-8 extra background worker slots.

Then, consider max_parallel_workers_per_gather. If you routinely processes large result sets, increasing it from the default of 2 to 4-6 is reasonable. Don’t go crazy here; a query rapidly reaches a point of diminishing returns in spinning up new parallel workers.

For max_parallel_maintenance_workers, 4-6 is also a good value. Go with 6 if you have a lot of cores, 4 if you have more than eight cores, and 2 otherwise.

Remember that every worker in parallel query execution can individually consume up to work_mem in working memory. Set that appropriately for the total number of workers that might be running at any one time. Note that it’s not just work_mem x max_parallel_workers_per_gather! Each individual worker can use more than work_mem if it has multiple operations that require it, and any non-parallel queries can do so as well.

Finally, max_parallel_workers, max_parallel_maintenance_workers, and max_parallel_workers_per_gather can be set for an individual session (or role, etc.), so if you are going to run an operation that will benefit from a large number of parallel workers, you can increase it for just that query. Note that the overall pool is still limited by max_worker_processes, and changing that requires a server restart.

ALTER TABLE … SET WITHOUT OIDS big gotcha

Xof — Wed, 22 Feb 2023 01:37:52 +0000

Normally, when you drop a column from PostgreSQL, it doesn’t have to do anything to the data in the table. It just marks the column as no longer alive in the system catalogs, and gets on with business.

There is, however, a big exception to this: ALTER TABLE … SET WITHOUT OIDS. This pops up when using pg_upgrade to upgrade a database to a version of PostgreSQL that doesn’t support table OIDs (if you don’t know what and why user tables in PostgreSQL had OIDs, that’s a topic for a different time).

ALTER TABLE … SET WITHOUT OIDS rewrites the whole table, and reindexes the table as well. This can take up quite a bit of secondary storage space:

On the tablespace that the current table lives in, it can take up to the size of the table as it rewrites the table.
On temporary file storage (pg_tmp), it can take significant storage doing the reindexing, since it may need to spill the required sorts to disk. This can be mitigated by increasing maintenance_work_mem.

So, plan for some extended table locking if you do this. If you have a very large database to upgrade, and it still has tables with OIDs, this may be an opportunity to upgrade via logical replication rather than pg_upgrade.

UUIDs vs serials for keys

Xof — Fri, 17 Feb 2023 04:34:42 +0000

This topic pops up very frequently: “Should we use UUIDs or bigints as primary keys?”

One of the reasons that the question gets so many conflicting answers is that there are really two different questions being asked:

“Should our keys be random or sequential?”
“Should our keys be 64 bits, or larger?”

Let’s take them independently.

Should our keys be random or sequential?

There are strong reasons for either one. The case for random keys is:

They’re more-or-less self-hashing, if the randomness is truly random. This means that if an outside party sees that you have a customer number 109248310948109, they can’t rely on you having a customer number 109248310948110. This can be handy if keys are exposed in URLs or inside of web pages, for example. You can expose 66ee0ea6-dad8-4b0b-af1c-bdc55ccd45e to the world with a pretty high level of confidence you haven’t given an attacker useful information.
It’s much easier to merge databases or tables together if the keys are random (and highly unlikely to collide) than if the keys are serials starting at 1.

The case for sequential keys is:

Sequential keys are (sometimes much) faster to generate than random keys.
Sequential keys have much better interaction with B-tree indexes than random keys, since inserting a new key doesn’t have to consult as many pages as it does in a random key. Different tests have come up with different results on how big the performance difference is, but random keys are always going to be slower than sequential ones in this case. (Note, however, that the tests almost always compare bigint to UUID, and that’s conflating both the sequential vs random and 64-bit vs 128-bit properties.)
As we note below, “sequential” doesn’t automatically mean bigint! There are implementations of UUIDs (or, at least, 128-bit UUID-like values) that have high order sequential bits but low order random bits. This avoids the index locality problems of purely random keys, while preserving (to an extent) the self-hashing behavior of random keys.

Should our keys be 64 bits, or larger?

It’s often just taken for granted than when we say “random” keys, we mean “UUIDs”, but there’s nothing intrinsic about bigint keys that means they have to be sequential, or (as we noted above) about UUID keys that require they be purely random.

bigint values will be more performant in PostgreSQL than 128 bit values. Of course, one reason is just that PostgreSQL has to move twice as much data (and store twice as much data on disk). A more subtle reason is the internal storage model PostgreSQL uses for values. The Datum type that represents a single value is the “natural” word length of the processor (64 bits on a 64 bit processor). If the value fits in 64 bits, the Datum is just the value. If it’s larger than 64 bits, the Datum is a pointer to the value. Since UUIDs are 128 bits, this adds a level of indirection and memory management to handling one internally. How big is this performance issue? Not large, but it’s not zero, either.

So, if you don’t think you need 128 bits of randomness (really, 124 bits plus a type field) that a UUID provides, consider using a 64 bit value even if it is random, or if it is (for example) 16 bits of sequence plus 48 bits of randomness.

Other considerations about sequential keys

If you are particularly concerned about exposing information, one consideration is that keys that have sequential properties, even just in the high bits, can expose the rate of growth of a table and the total size of it. This may be something you don’t want run the risk of leaking; a new social media network probably doesn’t want the outside world keeping close track of the size of the user table. Purely random keys avoid this, and may be a good choice if the key is exposed to the public in an API or URL. Limiting the number of high-order sequential bits can also mitigate this, and a (probably small) cost in locality for B-tree indexes.

“Database Patterns and How to Find Them” at SCaLE 2023

Xof — Thu, 16 Feb 2023 05:47:58 +0000

I’ll be speaking on Database Antipatterns and How to Find Them at SCaLE 2023, March 9-12, 2023 in Pasadena, CA.

“Extreme PostgreSQL” at PgDay/MED

Xof — Thu, 16 Feb 2023 01:30:47 +0000

I’m very happy that I’ll be presenting “Extreme PostgreSQL” at PgDay/MED in Malta (yay, Malta!) on 13 April 2023.

Xtreme PostgreSQL!

Xof — Wed, 08 Feb 2023 23:52:27 +0000

The slides from my talk at the February 2023 SFPUG Meeting are now available.

Nordic PgDay 2023

Xof — Mon, 30 Jan 2023 07:00:04 +0000

I’m very pleased to be talking about real-life logical replication at Nordic PgDay 2023, in beautiful Stockholm.

A foreign key pathology to avoid

Xof — Wed, 18 Jan 2023 07:00:59 +0000

There’s a particular anti-pattern in database design that PostgreSQL handles… not very well.

For example, let’s say you are building something like Twitch. (The real Twitch doesn’t work this way! At least, not as far as I know!) So, you have streams, and you have users, and users watch streams. So, let’s do a schema!

CREATE TABLE stream (stream_id bigint PRIMARY KEY);

CREATE TABLE "user" (user_id bigint PRIMARY KEY);

CREATE TABLE stream_viewer (    
    stream_id bigint REFERENCES stream(stream_id),    
    user_id bigint REFERENCES "user"(user_id),   
    PRIMARY KEY (stream_id, user_id));

OK, schema complete! Not bad for a day’s work. (Note the double quotes around "user". USER is a keyword in PostgreSQL, so we have to put it in double quotes to use as a table name. This is not great practice, but more about double quotes some other time.)

Let’s say we persuade a very popular streamer over to our platform. They go on-line, and all 1,252,136 of our users simultaneously log on and start following that stream.

So, we now have to insert 1,252,136 new records into stream_viewer. That’s pretty bad. But what’s worse is now we have 1,252,136 records with a foreign key relationship to a single record in stream. During the operation of the INSERT statement, the transaction that is doing the INSERT will take a FOR KEY SHARE lock on that record. This means that at any one moment, several thousand different transactions will have a FOR KEY SHARE lock on that record.

This is very bad.

If more than one transaction at a time has a lock on a single record, the MultiXact system handles this. MultiXact puts a special transaction ID in the record that’s locked, and then builds an external data structure that holds all of the transaction IDs that have locked the record. This works great… up to a certain size. But that data structure is of fixed size, and when it fills up, it spills onto secondary storage.

As you might imagine, that’s slow. You can see this with lots of sessions suddenly waiting on various MultiXact* lightweight locks.

You can get around this in a few ways:

Don’t have that foreign key. Of course, you also then lose referential integrity, so if the stream record is deleted, there may still be lots of stream_viewer records that now have an invalid foreign key.
Batch up the join operations. That way, one big transaction is doing the INSERTs instead of a large number of small ones. This can make a big difference in both locking behavior, and general system throughput. (For extra credit, use a COPY instead of an INSERT to process the batch.)

Not many systems have this particular design issue. (You would never actually build a streaming site using that schema, just to start.) But if you do, this particular behavior is a good thing to avoid.

OK, sometimes you can lock tables.

Xof — Mon, 16 Jan 2023 07:00:41 +0000

Previously, I wrote that you should never lock tables. And you usually shouldn’t! But sometimes, there’s a good reason to. Here’s one.

When you are doing a schema-modifying operation, like adding a column to a table, PostgreSQL needs to take an ACCESS EXCLUSIVE lock on the table while it is modifying the system catalogs. Unless it needs to rewrite the table, this lock isn’t held for very long.

However, locks in PostgreSQL are first-come, first-served. If the system is busy, there may be conflicting locks on the table that you are attempting to modify. (Even just a SELECT statement takes lock on the tables it is operating on; it just doesn’t conflict with much.) If the ALTER TABLE statement can’t get the lock right away, it enters a queue, waiting to get to the front and get the lock.

However, now, every lock after that enters the queue, too, behind that ALTER TABLE. This can create the result of a long-running ACCESS EXCLUSIVE lock, even though it’s not granted. On a busy table on a busy system, this can shut things down.

So, what to do?

You can do this:

DO $$
   BEGIN
   FOR i IN 1 .. 1000 LOOP
      BEGIN
         LOCK TABLE t NOWAIT;
         ALTER TABLE t ADD COLUMN i BIGINT;
         RETURN;
      EXCEPTION WHEN lock_not_available THEN
         PERFORM pg_sleep(1);
         CONTINUE;
      END;
   END LOOP;
   RAISE lock_not_available;
   END;
$$;

This loops until it can acquire the lock, but doesn’t sit in the queue if it can’t. Once it acquires the lock, it does the modification and exits. If it can’t acquire the lock after a certain number of cycles, it exits with an error (you can set the number of cycles to anything, and you can adjust time it sleeps after failing to get the lock).

How slow is DECIMAL, anyway?

Xof — Sun, 15 Jan 2023 07:00:10 +0000

In PostgreSQL, NUMERIC is a variable length type of fixed precision. You can have as many digits as you want (and you want to pay the storage for). DOUBLE PRECISION is a floating point type, with variable precision.

Sometimes, the question comes up: How much slower is NUMERIC than DOUBLE PRECISION, anyway?

Here’s a quick, highly unscientific benchmark:

Doing a simple test (100 million rows), a straight SUM() across a NUMERIC was about 2.2x slower than OUBLE PRECISION. It went up to about 4x slower if there was a simple calculation, SUM(n*12). It was about 5x slower if the calculation involved the same type, SUM(n*n). Of course, these are just on my laptop, but I would expect that the ratios would remain constant on other machines.

Inserting the 100 million rows took 72.2 seconds for DOUBLE PRECISION, 146.2 seconds for NUMERIC. The resulting table size was 3.5GB for DOUBLE PRECISION, 4.2GB for NUMERIC.

So, yes, NUMERIC is slower. But it’s not absurdly slower. NUMERIC is much slower than bigint (exercise left to the reader), so using NUMERIC for things like primary keys is definitely not a good idea.

Why submit a paper to PgDay 2020?

Xof — Fri, 15 Nov 2019 17:00:01 +0000

If you have something interesting to day about PostgreSQL, we [would love to get a proposal from you]. Even if you have never spoken before, consider responding to the CfP! PgDay 2020 is particularly friendly to first-time and inexperienced speakers. You’re among friends! If you use PostgreSQL, you almost certainly have opinions and experiences that others would love to hear about… go for it!

PgDaySF 2020!

Xof — Wed, 13 Nov 2019 20:26:24 +0000

The very first PgDay San Francisco is coming to the Swedish-American Hall on January 21, 2020. It’s going to be an amazing event.

If you have something to say about PostgreSQL…

… the Call for Proposals is now open through November 22, 2019. We are looking for 40 minute talks about anything related to PostgreSQL. First-time speakers are particularly encouraged to send in proposals.

If you are interested in or use PostgreSQL…

… Early-Bird Tickets are now available! Attendance is limited, so be sure to get your seat now.

If your company uses PostgreSQL…

… consider sponsoring the event! We can’t do it without our sponsors, and it is a great way to recruit PostgreSQL people. Show off your company to the PostgreSQL community!

“Look It Up: Practical PostgreSQL Indexing” at Nordic PGDay 2019

Xof — Wed, 20 Mar 2019 13:34:56 +0000

The slides from my presentation at PGDay Nordic 2019 are now available.

What’s up with SET TRANSACTION SNAPSHOT?

Xof — Mon, 11 Feb 2019 22:44:33 +0000

A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots.

The documentation is accurate, but it doesn’t really describe why one might want to do such a thing.

First, what is a “snapshot”? You can think of a snapshot as the current set of committed tuples in the database, a consistent view of the database. When you start a transaction and set it to REPEATABLE READ mode, the snapshot remains consistent throughout the transaction, even if other sessions commit transactions. (In the default transaction mode, READ COMMITTED, each statement starts a new snapshot, so newly committed work could appear between statements within the transaction.)

However, each snapshot is local to a single transaction. But suppose you wanted to write a tool that connected to the database in multiple sessions, and did analysis or extraction? Since each session has its own transaction, and the transactions start asynchronously from each other, they could have different views of the database depending on what other transactions got committed. This might generate inconsistent or invalid results.

This isn’t theoretical: Suppose you are writing a tool like pg_dump, with a parallel dump facility. If different sessions got different views of the database, the resulting dump would be inconsistent, which would make it useless as a backup tool!

The good news is that we have the ability to “synchronize” various sessions so that they all use the same base snapshot.

First, a transaction opens and sets itself to REPEATABLE READ or SERIALIZABLE mode (there’s no point in doing exported snapshots in READ COMMITTED mode, since the snapshot will get replaced at the very next statement). Then, that session calls pg_export_snapshot. This creates an identifier for the current transaction snapshot.

Then, the client running the first session passes that identifier to the clients that will be using it. You’ll need to do this via some non-database channel. For example, you can’t use LISTEN / NOTIFY, since the message isn’t actually sent until COMMIT time.

Each client that receives the snapshot ID can then do SET TRANSACTION SNAPSHOT ... to use the snapshot. The client needs to call this before it does any work in the session (even SELECT). Now, each of the clients has the same view into the database, and that view will remain until it COMMITs or ABORTs.

Note that each transaction is still fully autonomous; the various sessions are not “inside” the same transaction. They can’t see each other’s work, and if two different clients modify the database, those modifications are not visible to any other session, including the ones that are sharing the snapshot. You can think of the snapshot as the “base” view of the database, but each session can modify it (subject, of course, to the usual rules involved in modifying the same tuples, or getting serialization failures).

This is a pretty specialized use-case, of course; not many applications need to have multiple sessions with a consistent view of the database. But if you do, PostgreSQL has the facilities to do it!