UUIDs vs serials for keys
16 February 2023
This topic pops up very frequently: “Should we use
bigints as primary keys?”
One of the reasons that the question gets so many conflicting answers is that there are really two different questions being asked:
- “Should our keys be random or sequential?”
- “Should our keys be 64 bits, or larger?”
Let’s take them independently.
Should our keys be random or sequential?
There are strong reasons for either one. The case for random keys is:
They’re more-or-less self-hashing, if the randomness is truly random. This means that if an outside party sees that you have a customer number 109248310948109, they can’t rely on you having a customer number 109248310948110. This can be handy if keys are exposed in URLs or inside of web pages, for example. You can expose
66ee0ea6-dad8-4b0b-af1c-bdc55ccd45eto the world with a pretty high level of confidence you haven’t given an attacker useful information.
It’s much easier to merge databases or tables together if the keys are random (and highly unlikely to collide) than if the keys are serials starting at 1.
The case for sequential keys is:
Sequential keys are (sometimes much) faster to generate than random keys.
Sequential keys have much better interaction with B-tree indexes than random keys, since inserting a new key doesn’t have to consult as many pages as it does in a random key. Different tests have come up with different results on how big the performance difference is, but random keys are always going to be slower than sequential ones in this case. (Note, however, that the tests almost always compare
UUID, and that’s conflating both the sequential vs random and 64-bit vs 128-bit properties.)
As we note below, “sequential” doesn’t automatically mean
bigint! There are implementations of UUIDs (or, at least, 128-bit UUID-like values) that have high order sequential bits but low order random bits. This avoids the index locality problems of purely random keys, while preserving (to an extent) the self-hashing behavior of random keys.
Should our keys be 64 bits, or larger?
It’s often just taken for granted than when we say “random” keys, we mean “
UUIDs”, but there’s nothing intrinsic about
bigint keys that means they have to be sequential, or (as we noted above) about
UUID keys that require they be purely random.
bigint values will be more performant in PostgreSQL than 128 bit values. Of course, one reason is just that PostgreSQL has to move twice as much data (and store twice as much data on disk). A more subtle reason is the internal storage model PostgreSQL uses for values. The
Datum type that represents a single value is the “natural” word length of the processor (64 bits on a 64 bit processor). If the value fits in 64 bits, the
Datum is just the value. If it’s larger than 64 bits, the
Datum is a pointer to the value. Since
UUIDs are 128 bits, this adds a level of indirection and memory management to handling one internally. How big is this performance issue? Not large, but it’s not zero, either.
So, if you don’t think you need 128 bits of randomness (really, 124 bits plus a type field) that a
UUID provides, consider using a 64 bit value even if it is random, or if it is (for example) 16 bits of sequence plus 48 bits of randomness.
Other considerations about sequential keys
If you are particularly concerned about exposing information, one consideration is that keys that have sequential properties, even just in the high bits, can expose the rate of growth of a table and the total size of it. This may be something you don’t want run the risk of leaking; a new social media network probably doesn’t want the outside world keeping close track of the size of the
user table. Purely random keys avoid this, and may be a good choice if the key is exposed to the public in an API or URL. Limiting the number of high-order sequential bits can also mitigate this, and a (probably small) cost in locality for B-tree indexes.
There are 3 comments.
Tony Marston at 03:04, 17 February 2023:
When you mention “exposing information” are you implying that you expose primary key values in URLs? If so that is a big no-no which should be avoided at all costs.
CaptainKirkFan at 08:50, 22 February 2023:
You Allude to UUID7 and ULID::UUID being usable.
But don’t actually reference them.
I’ve recently tested this with others, and after a few million rows of data, the performance of regular UUIDs degrades so fast…
It is really not fair to suggest that ANY SIGNIFICANT part of the performance is the size. Most systems are smart enough to compare the High Order, and break if different (no need to compare the low order, unless you matched the high order).
And much larger (NUMERIC) keys can be used to prove they don’t suffer on INSERTS like UUIDs do after the first few million. Not even after the first Billion.
But, they are slower than BIGINTS. (About 50%).
The key thing is that time difference on a non-random key is constant.
There is also overhead for the randomness…
We create this function to give us a usec most significant portion, obviating the need for timestamp sorting. The other thing you give up with UUIDs is that you can’t meaningfully sort by the key.
CREATE OR REPLACE FUNCTION generate_ulid()
RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * (1000000)::numeric)))::bigint), 14, ‘0’::text) || encode(gen_random_bytes(9), ‘hex’::text)))::uuid
Miffed at 01:36, 23 February 2023:
> This can be handy if keys are exposed in URLs or inside of web pages, for example. You can expose 66ee0ea6-dad8-4b0b-af1c-bdc55ccd45e to the world with a pretty high level of confidence you haven’t given an attacker useful information.
Tom, I don’t think he’s implying it, I think he’s stating it verbatim.
It is so infinitely infuriating when someone just says “don’t do that”.
Why shouldn’t I? What is the alternative? Care to elaborate?