30 October 2024
07:50
(A quick interlude in our series on locales and character encodings in PostgreSQL.)
C/POSIX locale is only useful if you know that you will never have characters in your strings other than 7-bit ASCII, or you are 100% sure that you are never going to have mixed client encodings. If you do have non-7-bit-ASCII characters with mixed client encodings, zaniness can result:
c=# \l
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges
-----------+------------------+----------+-----------------+-------------+-------------+-----------+-----------+-------------------
c | xof | UTF8 | libc | POSIX | POSIX | | |
(17 rows [othes deleted])
c=# set client_encoding='UTF8';
SET
c=# CREATE TABLE m (v varchar);
CREATE TABLE
c=# INSERT INTO m VALUES('Í');
INSERT 0 1
c=# set client_encoding='LATIN1';
SET
c=# INSERT INTO m VALUES('é');
INSERT 0 1
c=# SELECT * FROM m;
v
----
?
é
(2 rows)
c=# set client_encoding='UTF8';
SET
c=# SELECT * FROM m;
v
----
Í
é
(2 rows)
c=# INSERT INTO m VALUES('‰');
INSERT 0 1
c=# SELECT * FROM m;
v
----
Í
é
‰
(3 rows)
c=# set client_encoding='LATIN1';
SET
c=# SELECT * FROM m;
ERROR: character with byte sequence 0xe2 0x80 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN1"
Even worse (in this regard) than C locale is SQL_ASCII
locale. Please never use this locale.
27 October 2024
03:18
This is the second installment in our discussion of locales, character encodings, and collations in PostgreSQL. In this installment, we’ll talk about character encodings as they relate to PostgreSQL.
A quick reminder!
A character encoding is a mapping between code points (that is, numbers) and glyphs (what us programmers usually call characters). There are lots, and lots, and lots of different character encodings, most of them some superset of good old 7-bit ASCII.
PostgreSQL and Encodings
From the point of view of the computer, a character string is just a sequence of bytes (maybe terminated by a zero byte, maybe with a length). If the only thing PostgreSQL had to do with character strings was store them and return them to the client, it could just ignore that character encodings even exist.
However, databases don’t just store character strings: they also compare them, build indexes on them, change them to upper case, do regex searches on them, and other things that mean they need know what the characters are and how to manipulate them. So, PostgreSQL needs to know what character encoding applies to the text it is storing.
You specify the character encoding for a database when creating the database. (Each database in a PostgreSQL instance can have a different encoding.) That encoding applies to all text stored in the database, and can’t be changed once the database is created: your only option to change encodings is to use pg_dump
(or a similar copy-based system, like logical replication) and transfer the data to a new database with the new encoding. This makes the choice of what encoding to pick up front very important.
This is stupid and I don’t want to think about it.
If you search the web, you’ll see a lot of advice that you should just “ignore locales” in PostgreSQL and just use “C” locale. This advice usually dates from around when PostgreSQL first introduced proper locale support, and there was a certain amount of culture shock in having to deal with these issues.
This is terrible advice. If you are processing human-readable text, you need to think about these topics. Buckle up.
The good news is that there is a correct answer to the question, “What encoding should I use?”
Use UTF-8
The character encoding decision is an easy one: Use UTF-8. Although one can argue endlessly over whether or not Unicode is the “perfect” way of encoding characters, we almost certainly will not get a better one in our lifetime, and it has become the de facto (and in many cases, such as JSON, the de jure) standard for text on modern systems.
The only exception is that if you are building a highly specialized database that will only ever accept text in a particular encoding. The chance of this is extremely small, an even then, there is a strong argument for converting the text to UTF-8 instead.
But what about C encoding?
There is a downside to using UTF-8: It’s slower than basic “C” encoding. (“C” encoding, as a reminder, just takes the binary strings as it finds them and compares them that way, without regard to what natural language code points they represent.) The difference is significant: on a quick test of sorting 1,000,000 strings, each 64 characters and ASCII-only, UTF-8 encoding with the en_UTF8
collation was almost 18x times slower than C encoding and collation.
There are, however, some significant problems:
The database will just accept any string of bytes that it gets as text, which means that if different clients are sending in different strings with different encodings (which happens very frequently), you’ll end up with a database with mixed encodings that will send the wrong encodings back to the client. The first time someone stores any character that it’s 7-bit ASCII, the probability for mayhem goes up considerably, and it’s very hard to fix this once it happens.
You will get sortation in a way that doesn’t match any natural language, and can be very surprising to a human reading the output.
Unless sorting a large number of character strings on a regular basis is a bottleneck, or you know that you will never have strings that correspond to natural language, the performance improvement is not going to be worth the inevitable issues. Just use UTF-8 coding.
(There are some reasonable alternatives we will discuss later, with their own set of trade-offs.)
Coming up next…
Having decided that we are going to use UTF-8 as the character encoding, which collation should we use? PostgreSQL has an embarrassingly large number of options here, and version 17 introduced some new ones!
25 October 2024
03:50
If you are not familiar with the quote.
This is part one of a series on PostgreSQL and collations, and how to use them without tears. This is an introduction to the general concepts of glyphs, character encodings, collations, and locales.
Glyphs
There is (as always in things involving real human behavior) some controversy over what is a gylph and what isn’t, but as a first approximation, we can use this definition:
These are all glyphs:
A z 4 ü
Symbols are also glyphs:
^ @ ! ‰
Some things are both:
∏
Capital π is both a mathematic symbol and a letter of the Greek alphabet. Of course, most Greek letters are also mathematical symbols.
Glyphs also include characters from other languages that are not alphabetic:
這些都是繁體中文的字形
ᎣᏍᏓ ᏑᎾᎴᎢ
And glyphs include symbols that aren’t traditional language components at all:
🏈 🐓 🔠 🛑
Strictly speaking, there is a distinction between a “glyph” and a “grapheme.” The “glyph” is the mark itself, which can be different depending on typeface, weight, etc. The “grapheme” is the underlying functional unit that the glyph represents. In this definition, these are all the same “grapheme” (lower-case Latin a) but different glyphs:
However, everything involving characters in computing calls the fundamental units “glyphs” rather than “graphemes,” so that’s the term you’ll see here.
A glyph corresponds to what most programmers intuitively think of as a character, and the terms are more or less interchangeable, with some exceptions that we’ll talk about in future articles.
Character Encodings
We need computers to be able to understand glyphs. That means we need to turn glyphs into numbers. (A number that represents a glyph is that glyph’s code point.)
For that, we have a character encoding. A character encoding is a bidrectional mapping between a set of glyphs and a set of numbers. For programmers, probably the most familiar character encoding is ASCII, which maps a (very) limited set of 95 glyphs to 7-bit numbers. (There are encodings even older than ASCII.) Some greatest ASCII hits include:
A ↔ 65 (0x41)
a ↔ 97 (0x61)
* ↔ 42 (0x2A)
ASCII has been around for a long time (it achieved its modern form in 1967). It has some nice features: if an ASCII code represents an upper-case letter, it can be turned into the matching lower-case letter with code + 0x20
, and back to upper-case with code - 0x20
. If an ASCII code is a numeric glyph, it can be turned into the actual numeric equivalent with code - 0x30
.
(You’ll sometimes hear the phrase “7-bit ASCII.” This is, strictly speaking, redundant. ASCII doesn’t define any code points above 127. Everything that uses the range 128-255 is an extension to ASCII. “7-bit ASCII” is useful, though, when you want to specify “ASCII with no extensions.”)
ASCII has huge limitations: It was designed to represent the characters that appeared on a typical English-language computer terminal in 1967, which means it lacks glyphs for the vast majority of languages in the world (even English! ï
, for example, is very commonly used in English, as is é
).
It became instantly obvious that ASCII wasn’t going to cut it as a worldwide character encoding. Diffferent languages started developing their own character encodings, some based on ASCII codes greater than 127, some on newly-invented character encodings that took more than a single 8-bit byte to represent. Chaos reigned.
Unicode
In an attempt to bring some kind of order to character encodings, an effort began in the late 1980s to create a universal encoding that would include all glyphs from all languages. This was just as ambitious as it sounds! In 1991, the Unicode Consortium was formed, and the first version of Unicode was published in October 1991.
Unicode code points are 32 bit numbers, organized into “code planes.” The “Basic Multilingual Plane,” which contains most glyphs for most living languages, has the top 16 bits equal to zero, but Unicode code points can be, and often are in real life, greater than 65,535.
So, is Unicode a “character encoding”? Well, yes and no. It maps glyphs to code points and back again, so it qualifies as a character encoding in that sense. But Unicode deliberately does not specify how the particular code points are to be stored inside the computer.
To find out how the code points are actually stored inside the computer, we need to talk about Unicode Transformation Formats.
UTF-8, UTF-16, etc.
A Unicode Transformation Format (UTF) is a way of encoding a 32-bit Unicode code point. (Yes, we’re encoding an encoding. Computers are great!)
The simplest UTF is UTF-32: we just take four bytes for each Unicode code point, and that’s that. It’s simple and doesn’t require anything special to decode (well, almost), but it means we are taking four bytes for every glyph. Given that a large percentage of the time, computers are dealing with ASCII characters, we’ve just made our character strings four times larger. No one wants that.
So, UTF-8 and UTF-16 were invented, and were one of the cleverest inventions ever. The rules for both are:
- Consider the Unicode code point as a 32 bit number.
- If the upper 25 bits (for UTF-8) or 17 bits (for UTF-16) are zero, the encoding is just the lower 7 bits (for UTF-8) or 15 bits (for UTF), encoding as one or two 8-bit bytes.
- If any of the upper 25/17 bits are not zero, the code point is broken into multiple 1-byte (UTF-8) or 2-byte (UTF-16) sequences based on some clever rules,
One of the very nice features of Unicode is that any code point where the upper 25 bits are zero is exactly the same as the ASCII code point for the same glyph, so text encoded in ASCII is also encoded in UTF-8 with no other processing.
UTF-8 is the overwhelming favorite for encoding Unicode.
One downside of UTF-8 is that a “character” in the traditional sense is no longer fixed length, so you can’t just count bytes to tell how many characters are in a string. Programming languages have struggled with this for years, although the situation seems to be finally calming down.
In order to process UTF-8 property, the code must know that it’s getting UTF-8. If you have ever seen ugly things like ‚Äù
on a web page where a “
should be, you’ve seen what happens when UTF-8 isn’t interpreted properly.
Collations
Computers have to compare character strings. From this really simple requirement, no end of pain has flowed.
A collation is just a function that takes two strings (that is, ordered sequences of glyphs represented as code points), and says whether or not they are less than, equal to, or greater than each other:
f(string1, string2) → (<, =, >)
The problem, as always, is human beings. Different languages have different rules for what order strings should appear in, even if exactly the same glyphs are used. Although there have been some heroic attempts to define collation rules that span languages, languages and they way they are written tend to be very emotional issues tied up with feelings of culture and national pride (“a language is a dialect with an army and a navy”).
There are a lot of collations. Collations by their nature are associated with a particular character encoding, but one character encoding can have many different collations. For example, just on my laptop, UTF-8 has 53 different collations.
Locale
On POSIX-compatible systems (which include Linux and *BSD), collations are one part of a bundle called a locale. A locale includes several different utility functions; they typically are:
- Number format setting
- Character encoding
- Some related utility functions (such as how to convert between cases in the encoding).
- Date-time format setting
- The string collation to use
- Currency format setting
- Paper size setting
Locales on POSIX systems have names that look like:
fr_BE.UTF-8
fr
defines the langauge (French), BE
defines the “territory” (Belgium, so we are talking about Belgian French), and UTF-8
is the character encoding. From this, we can determine that the character strings are encoded as UTF-8, and the collation rules are the rules for Belgian French.
If there’s only one character encoding available for a particular combination of language and territory, you’ll sometimes see the locale written without the encoding, like fr_BE
.
There is one locale (with two different names) that is a big exception to the rules, and an important one:
C or POSIX
The C
locale (also called POSIX
) uses rules defined by the C/C++ language standard. This means that the system doesn’t have an opinion about what encoding the strings are in, and it’s up to the programmer to keep track of them. (This is really true in any locale, but in the C
locale, the system doesn’t even provide a hint as to what encoding to use or expect.) The collation function just compares strings byte-wise; this works great on ASCII, and is meaningless (even for equality!) on UTF-8 or UTF-16. For functions, like case conversion, that need to know a character encoding, the C locale uses the rules for ASCII, which are completely and utterly wrong for most other encodings.
PostgreSQL also provides a C.UTF-8
locale, which is a special case on a special case. We’ll talk about that in a future installment.
Locale Provider
Locales are really just bundles of code and data, so they need to come from somewhere so the system can use them. The code libraries that contain locales are locale providers. There are two important ones (plus one other that’s relevant to PostgreSQL only).
(g)libc
libc
is the library that implements the standard C library (and a lot of other stuff) on POSIX systems. It’s where 99% of programs on POSIX systems get locale information from. Different distributions of Linux and *BSD systems have different sets of locales that are provided as part of the base system, and others than can be installed in optional packages. You’ll usually see libc
as glibc
, which is the Free Software Foundation’s GNU Project’s version of libc.
ICU
The International Components for Unicode is a set of libraries from the Unicode Consortium to provide utilties for handling Unicode. ICU includes a large number of functions, including a large set of locales. One of the most important tools provided by ICU is an implementation of the Unicode Collation Algorithm, which provides a basic collation algorithm usable across multple languages.
(Full disclosure: The work that ultimately became the ICU was started at Taligent, where I was the networking architect. It’s without a doubt the most useful thing Taligent ever produced.)
The PostgreSQL Built-In Provider
In version 17, PostgreSQL introduced a built-in locale provider. We’ll talk about that in a future installment.
What’s Next?
In the next installment, we’ll talk about how PostgreSQL uses all of this, and why it has resulted in some really terrible situations.
10 July 2024
10:48
I turned the last blog post into a talk; you can get the slides here.
25 March 2024
10:00
The PostgreSQL roles and privileges system can be full of surprises.
Let’s say we have a database test
, owned by user owner
. In it, we create a very secret function f
that we do not want just anyone to be able to execute:
test=> select current_user;
current_user
--------------
owner
(1 row)
test=> CREATE FUNCTION f() RETURNS int as $$ SELECT 1; $$ LANGUAGE sql;
CREATE FUNCTION
test=> select f();
f
---
1
(1 row)
There are two other users: hipriv
and lowpriv
. We want hipriv
to be able to run the function, but not lowpriv
. So, we grant EXECUTE
to hipriv
, but revoke it from lowpriv
:
test=> GRANT EXECUTE ON FUNCTION f() TO hipriv;
GRANT
test=> REVOKE EXECUTE ON FUNCTION f() FROM lowpriv;
REVOKE
Let’s test it! We log in as hipriv
and run the function:
test=> SELECT current_user;
current_user
--------------
hipriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
Works great. Now, let’s try it as lowpriv
:
test=> SELECT current_user;
current_user
--------------
lowpriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
Wait, what? Why did it let lowpriv
run f()
? We explicitly revoked that permission! Is the PostgreSQL privileges system totally broken?
Well, no. But there are some surprises.
Let’s look at the privileges on f()
:
test=> SELECT proacl FROM pg_proc where proname = 'f';
proacl
-----------------------------------------
{=X/owner,owner=X/owner,hipriv=X/owner}
(1 row)
The interpretation of each of the entries is “=/“. We see that owner
has X (that is, EXECUTE
) on f()
granted by itself, and hipriv
has EXECUTE
granted by owner
. But what’s with that first one that doesn’t have a role at the start? And where is our REVOKE
on lowpriv
?
The first thing that may be surprising is that there is no such thing as a REVOKE
entry in the privileges. REVOKE
removes a privilege that already exists; it doesn’t create a new entry that says “don’t allow this.” This means that unless there is already an entry that matches the REVOKE
, REVOKE
is a no-op.
The second thing is that if there is no role specified that, that means the special role PUBLIC
. PUBLIC
means “all roles.” So, anyone can execute f()
! This is the default privilege for new functions.
Combined, this means that when the function was created, EXECUTE
was granted to PUBLIC
. The REVOKE
was a no-op, because there was no explicit grant of privileges to lowpriv
.
How do we fix it? First, we can revoke that undesirable first grant to PUBLIC
:
test=> REVOKE EXECUTE ON FUNCTION f() FROM PUBLIC;
REVOKE
hipriv
can still run the function, because we gave it an explicit grant:
test=> SELECT current_user;
current_user
--------------
hipriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
But lowpriv
can’t skate in under the grant to PUBLIC
, so it can’t run the function anymore:
test=> SELECT current_user;
current_user
--------------
lowpriv
(1 row)
test=> SELECT f();
ERROR: permission denied for function f
So, if you are counting on the PostgreSQL privilege system to prevent roles from running functions (and accessing other objects), be sure you know what the default permissions are, and adjust them accordingly.
4 October 2023
08:16
The slides for my talk “Look It Up: Real-Life Database Indexing” are now available.
9 May 2023
10:58
I shouldn’t have to say this, but don’t use ChatGPT for technical advice.
In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space).
LLMs are not a replacement for expertise.
3 May 2023
09:55
Recently on one of the PostgreSQL mailing lists, someone wrote in asking if it was possible to get PostgreSQL to listen on two ports. The use case, to paraphrase, was that there was a heterogeneous mix of clients, some of which could connect with TLS, some of which couldn’t. They wanted the clients that could use TLS to do so, while allowing the non-TLS clients access.
The simple answer is: Upgrade your non-TLS clients already! But of course the world is a complicated place. And for reasons that weren’t given (but which we will accept for now), it has to be two different ports.
The PostgreSQL server itself can only listen on one port. But there were two options presented that could fix this:
- Run pgbouncer with TLS turned on, on a different port, and have it forward the connections to the PostgreSQL server via a local socket.
- Run stunnel to listen for TLS connections, and route those to PostgreSQL.
I don’t imagine many people will have this exact situation, but if you do… there are options!
2 May 2023
10:23
I’ll be speaking about Writing a Foreign Data Wrapper at PGCon 2023 in Ottawa, May 30-June 2, 2023. Do come! It’s the premiere technical/hacker conference for PostgreSQL.
30 March 2023
01:14
In a comment on my earlier post on max_wal_size
, Lukas Fittl asked a perfectly reasonable question:
Re: “The only thing it costs you is disk space; there’s no other problem with it being too large.”
Doesn’t this omit the fact that a higher max_wal_size
leads to longer recovery times after a crash? In my experience that was the reason why you wouldn’t want max_wal_size
to e.g. be 100GB, since it means your database might take a while to get back up and running after crashes.
The answer is… as you might expect, tricky.
The reason is that there are two different ways a checkpoint can be started in PostgreSQL (in regular operations, that is; there’s a few more, such as manual CHECKPOINT
commands and the start of a backup using pg_start_backup
). Those are when PostgreSQL thinks it needs to checkpoint to avoid overrunning max_wal_size
(by too much), and when checkpoint_timeout
is reached. It starts a checkpoint on the first of those that it hits.
The theory behind my recommendations on checkpoint tuning is to increase max_wal_size
to the point that you are sure that it is always checkpoint_timeout
that fires rather than max_wal_size
. That in effect caps the checkpoint interval, so larger values of max_wal_size
don’t change the checkpoint behavior once it has reached the level that checkpoint_timeout
is always the reason a checkpoint starts.
But Lukas does raise a very good point: the time it takes to recover a PostgreSQL system from a crash is proportionate to the amount of WAL that it has to replay, in bytes, and that’s soft-capped by max_wal_size
. If crash recovery speed is a concern, it might make sense to not go crazy with max_wal_size
, and cap it at a lower level.
Pragmatically, crashes are not common and checkpoints are very common, so I recommend optimizing for checkpoint performance rather than recovery time… but if your system is very sensitive to recovery time, going crazy with max_wal_size
is probably not a good idea.