15 November 2024
12:31
In this installment of our series on locales and PostgreSQL, we’ll talk about what can happen with the library that provides locales changes out from under a PostgreSQL database. It’s not pretty.
First, more about collations
Back in an earlier installment, we described a collation as a function that takes two strings, and returns if the strings are greater than, equal to, or less than each other:
coll(a, b) -> {greater than, equal, less than}
This is a slight oversimplification. What the actual collation function does is take a string, and returns a sort key:
coll(a) -> Sa
The sort keys have the property that if the strings they are derived from have (within the rules of the collation) a particular relationship, the sort keys do as well. This is very valuable, because it allows doing these comparisons repeatedly without having to rerun the (potentially expensive) collation function.
In the case of C locale, the collation function is the identity function: you compare exactly the same string of bytes that is present in the input. In more sophisticated locales, the collation function “normalizes” the input string into a string that can be repeatedly compared by just comparing the bytes.
What is an index, anyway?
Everyone who works with databases has an intuitive idea of what an index is, but for our purposes, an index is a cache of a partiuclar sortation of the input column. You can think of an index defined as:
sql
CREATE INDEX on t(a) WHERE b = 1;
as a cache of the query predicate:
sql
SELECT ... WHERE b = 1 ORDER BY a;
This allows the database engine to quickly find rows that use that predicate, as well as (often) retrieve a set of rows in sorted order without having to resort them.
The database automatically updates the “cache” when a row changes in a way that would alter the index. This generally works very, very well.
But:
Cache invalidation is hard.
Suppose that the collation function changes, but no one tells the database engine that. Now, the “cache” in the form of the index is invalid, since it holds the sort keys from the old collation function, but any new entries in the index will use the new sortation function.
This has happened.
Version 2.28 of glibc included many signficiant collation changes. There are many situations in which a PostgreSQL database whose indexes where built on an earlier version might find itself running on the newer version:
- Upgrading the OS (for example, on Ubuntu, from a version before 18.10 to that version or later).
- Creating a binary replica on a newer version of the OS with the new glibc.
- Restoring from a PITR backup taken on the older version to a system running the newer version.
The result is suddenly, indexes aren’t working the way they should be: The results from queries on them that use text indexes can be wrong. This is particularly perplexing in the binary replica case, since queries on the system running the older version behave correctly, while ones on the newer versions are incorrect.
ICU has the same problem, although its collation functions change less often, so switching to ICU collations does not completely solve the problem.
Starting with version 10 (for ICU) and version 13 (for libc), PostgreSQL records the current version of the libraries being used as locale providers, and will issue a warning if the current version on the system does not match the version that was used previously:
WARNING: collation "xx-x-icu" has version mismatch
DETAIL: The collation in the database was created using version 1.2.3.4, but the operating system provides version 2.3.4.5.
HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right library version.
This is a very good reason to monitor log output for warnings, since this is the only place that you might see this message.
It’s possible to get false positives here. A change to the library does not necessarily mean that the collation changed. However, it’s always a good idea to be cautious and rebuild objects that depend on changed collations. The PostgreSQL documentation provides instructions on how to do so. The PostgreSQL Wiki also has a good entry about this problem.
In the next installment, we’ll cover a subject near and dear to every database person’s heart: speed.
14 November 2024
17:57
In our this installment about character encodings, locations, and locales in PostgreSQL, we’re talking about locale providers.
A reminder about locales
A “locale” is a bundled combination of data structures and code that provides support services for handling different localization services. For our purposes, the two most important things that a locale provides are:
- A character encoding.
- A collation.
And as a quick reminder, a character encoding is a mapping between a sequence of one or more bytes and a “glyph” (what we generally think of as a charcter, like A
or *
or Γ
or π€·
). A collation is a function that places strings of characters into a total ordering.
UTF-8
Everything here assumes you are using UTF-8 encoding in your database. If you are not, please review why you should.
Getting locale
A locale provider is a library (in the computing sense, a hunk of code and data) whose API provides a set of locales and lets you use the functions in them. There are three interesting locale providers in the PostgreSQL world:
- libc β This is the implementation of the
C
standard library on every POSIX-compatible system, and many others. You’ll most often see glibc, which is the GNU project’s implemenatation of libc.
- ICU β The Inernational Components for Unicode. This is a (very) extensive library for working with Unicode text. Among other features, it has a comprehensive locale system.
- builtin β As of version 17, PostgreSQL has a builtin collation provider, based on locales built into PostgreSQL itself.
libc
When nearly everyone thinks of locales, they think of libc
. Really, the mostly think of the ubiquitous en_US.UTF-8
locale. If you do not specify a locale when creating a new PostgreSQL instance using initdb
, it uses the system locale, which very often en_US.UTF-8
. (Or C
locale, which is almost never the right choice.)
The structure of a glibc locale name is (using en_US.UTF-8
as an example):
en
β The base language of the locale, usually the ISO 639 code.
- US β Is the “region.” The “region” is used if there are subtypes of locales within the language.
US
means US (“American”) English.
UTF-8
βΒ Is the character encoding.
- There’s also a (rarely seen in practice) optional “modifier” that can be postpended, staring with an
@
. For example, @euro
means to use the Euro currency symbol. (Not all modifiers apply to all locales.)
The exact set of locales available on a particular system depend on the particular flavor of *NIX system, the distribution, and what packages have been installed. The command locale -a
will list what’s currently installed, and locale
will show the various settings of the current locale.
Different platforms can have different collations for the same locale when using libc, and the collation rules for a particular locale can change between libc releases. (We’ll talk about what happens here in a future installment.)
ICU
ICU is an extremely powerful system for dealing with text in a wide variety of languages, and has a head-spinning number of options and configurations. The set of possible ICU collations is enormous, and this just scratches the surface.
In PostgreSQL, ICU locales can be used by creating one in the database using the CREATE COLLATION
command. An example is:
CREATE COLLATION english (provider = icu, locale = 'en-US');
The basic structure of ICU collation names is simliar to that of libc collations: a language code, followed by an optional region (in ICU vernacular, these two components are a language tag). In addition to the language tag, locale names can take key/value pairs of options, introduced by a -u-
extension in the locale name. An example from the PostgreSQL documentation is:
CREATE COLLATION mycollation5 (provider = icu, deterministic = false, locale = 'en-US-u-kn-ks-level2');
Taking the components of the name piece by piece:
en
— English.
US
— United States English.
u
— Introduces Unicode Consortium-specific options.
kn
— Sort numbers as a single numeric value rather than as a sequence of digit characters. ‘123’ will sort after 45
, for example. There is no “value” part, as kn
is a true/false flag.
ks-level2
— Sets the “sensitivity” of the collation to level2
. You can think of the “level” as how much the collation “cares” about differences between characters. level2
means that the collation doesn’t “care” about upper-case vs lower case characters, and will thus do case-insensitive sortation.
ICU also defines a language code of und
meaning (“undefined”). To distinguish libc-style collations from ICU-style locations when only a language tag is given, -x-icu
is added to the end of the name.
ICU is a library like libc, and like libc, the collation algorithms can change between releases (although, in practice, this happens less frequently than it does with libc).
builtin
As of PostgreSQL 17, PostgreSQL has a built-in locale provider, using the idiosyncratic name of builtin
. The builtin provider, right now, defines exactly two locales:
C
— This is identical to the C
locale provided by libc, but is implemented within the PostgreSQL core.
C.UTF-8
— This sorts based on the numeric Unicode codepoint rather than just treating the text as a sequence of bytes (which ignores encodings). This produces results that, while often not correct for any language except US English (pretty much), are deterministic, and performs very well. Needless to say, this is only available for databases in UTF-8
encoding.
So, why is there a builtin provider at all? Doesn’t libc and ICU cover the space completely. Well, yes, with one very important exception, which we’ll talk about in our next installment.
27 October 2024
03:18
This is the second installment in our discussion of locales, character encodings, and collations in PostgreSQL. In this installment, we’ll talk about character encodings as they relate to PostgreSQL.
A quick reminder!
A character encoding is a mapping between code points (that is, numbers) and glyphs (what us programmers usually call characters). There are lots, and lots, and lots of different character encodings, most of them some superset of good old 7-bit ASCII.
PostgreSQL and Encodings
From the point of view of the computer, a character string is just a sequence of bytes (maybe terminated by a zero byte, maybe with a length). If the only thing PostgreSQL had to do with character strings was store them and return them to the client, it could just ignore that character encodings even exist.
However, databases don’t just store character strings: they also compare them, build indexes on them, change them to upper case, do regex searches on them, and other things that mean they need know what the characters are and how to manipulate them. So, PostgreSQL needs to know what character encoding applies to the text it is storing.
You specify the character encoding for a database when creating the database. (Each database in a PostgreSQL instance can have a different encoding.) That encoding applies to all text stored in the database, and can’t be changed once the database is created: your only option to change encodings is to use pg_dump
(or a similar copy-based system, like logical replication) and transfer the data to a new database with the new encoding. This makes the choice of what encoding to pick up front very important.
This is stupid and I don’t want to think about it.
If you search the web, you’ll see a lot of advice that you should just “ignore locales” in PostgreSQL and just use “C” locale. This advice usually dates from around when PostgreSQL first introduced proper locale support, and there was a certain amount of culture shock in having to deal with these issues.
This is terrible advice. If you are processing human-readable text, you need to think about these topics. Buckle up.
The good news is that there is a correct answer to the question, “What encoding should I use?”
Use UTF-8
The character encoding decision is an easy one: Use UTF-8. Although one can argue endlessly over whether or not Unicode is the “perfect” way of encoding characters, we almost certainly will not get a better one in our lifetime, and it has become the de facto (and in many cases, such as JSON, the de jure) standard for text on modern systems.
The only exception is that if you are building a highly specialized database that will only ever accept text in a particular encoding. The chance of this is extremely small, an even then, there is a strong argument for converting the text to UTF-8 instead.
But what about C encoding?
There is a downside to using UTF-8: It’s slower than basic “C” encoding. (“C” encoding, as a reminder, just takes the binary strings as it finds them and compares them that way, without regard to what natural language code points they represent.) The difference is significant: on a quick test of sorting 1,000,000 strings, each 64 characters and ASCII-only, UTF-8 encoding with the en_UTF8
collation was almost 18x times slower than C encoding and collation.
There are, however, some significant problems:
The database will just accept any string of bytes that it gets as text, which means that if different clients are sending in different strings with different encodings (which happens very frequently), you’ll end up with a database with mixed encodings that will send the wrong encodings back to the client. The first time someone stores any character that it’s 7-bit ASCII, the probability for mayhem goes up considerably, and it’s very hard to fix this once it happens.
You will get sortation in a way that doesn’t match any natural language, and can be very surprising to a human reading the output.
Unless sorting a large number of character strings on a regular basis is a bottleneck, or you know that you will never have strings that correspond to natural language, the performance improvement is not going to be worth the inevitable issues. Just use UTF-8 coding.
(There are some reasonable alternatives we will discuss later, with their own set of trade-offs.)
Coming up nextβ¦
Having decided that we are going to use UTF-8 as the character encoding, which collation should we use? PostgreSQL has an embarrassingly large number of options here, and version 17 introduced some new ones!
25 October 2024
03:50
If you are not familiar with the quote.
This is part one of a series on PostgreSQL and collations, and how to use them without tears. This is an introduction to the general concepts of glyphs, character encodings, collations, and locales.
Glyphs
There is (as always in things involving real human behavior) some controversy over what is a gylph and what isn’t, but as a first approximation, we can use this definition:
These are all glyphs:
A z 4 ΓΌ
Symbols are also glyphs:
^ @ ! β°
Some things are both:
β
Capital Ο is both a mathematic symbol and a letter of the Greek alphabet. Of course, most Greek letters are also mathematical symbols.
Glyphs also include characters from other languages that are not alphabetic:
ιδΊι½ζ―ηΉι«δΈζηεε½’
α£αα ααΎα΄α’
And glyphs include symbols that aren’t traditional language components at all:
π π π π
Strictly speaking, there is a distinction between a “glyph” and a “grapheme.” The “glyph” is the mark itself, which can be different depending on typeface, weight, etc. The “grapheme” is the underlying functional unit that the glyph represents. In this definition, these are all the same “grapheme” (lower-case Latin a) but different glyphs:
However, everything involving characters in computing calls the fundamental units “glyphs” rather than “graphemes,” so that’s the term you’ll see here.
A glyph corresponds to what most programmers intuitively think of as a character, and the terms are more or less interchangeable, with some exceptions that we’ll talk about in future articles.
Character Encodings
We need computers to be able to understand glyphs. That means we need to turn glyphs into numbers. (A number that represents a glyph is that glyph’s code point.)
For that, we have a character encoding. A character encoding is a bidrectional mapping between a set of glyphs and a set of numbers. For programmers, probably the most familiar character encoding is ASCII, which maps a (very) limited set of 95 glyphs to 7-bit numbers. (There are encodings even older than ASCII.) Some greatest ASCII hits include:
A β 65 (0x41)
a β 97 (0x61)
* β 42 (0x2A)
ASCII has been around for a long time (it achieved its modern form in 1967). It has some nice features: if an ASCII code represents an upper-case letter, it can be turned into the matching lower-case letter with code + 0x20
, and back to upper-case with code - 0x20
. If an ASCII code is a numeric glyph, it can be turned into the actual numeric equivalent with code - 0x30
.
(You’ll sometimes hear the phrase “7-bit ASCII.” This is, strictly speaking, redundant. ASCII doesn’t define any code points above 127. Everything that uses the range 128-255 is an extension to ASCII. “7-bit ASCII” is useful, though, when you want to specify “ASCII with no extensions.”)
ASCII has huge limitations: It was designed to represent the characters that appeared on a typical English-language computer terminal in 1967, which means it lacks glyphs for the vast majority of languages in the world (even English! Γ―
, for example, is very commonly used in English, as is Γ©
).
It became instantly obvious that ASCII wasn’t going to cut it as a worldwide character encoding. Diffferent languages started developing their own character encodings, some based on ASCII codes greater than 127, some on newly-invented character encodings that took more than a single 8-bit byte to represent. Chaos reigned.
Unicode
In an attempt to bring some kind of order to character encodings, an effort began in the late 1980s to create a universal encoding that would include all glyphs from all languages. This was just as ambitious as it sounds! In 1991, the Unicode Consortium was formed, and the first version of Unicode was published in October 1991.
Unicode code points are 32 bit numbers, organized into “code planes.” The “Basic Multilingual Plane,” which contains most glyphs for most living languages, has the top 16 bits equal to zero, but Unicode code points can be, and often are in real life, greater than 65,535.
So, is Unicode a “character encoding”? Well, yes and no. It maps glyphs to code points and back again, so it qualifies as a character encoding in that sense. But Unicode deliberately does not specify how the particular code points are to be stored inside the computer.
To find out how the code points are actually stored inside the computer, we need to talk about Unicode Transformation Formats.
UTF-8, UTF-16, etc.
A Unicode Transformation Format (UTF) is a way of encoding a 32-bit Unicode code point. (Yes, we’re encoding an encoding. Computers are great!)
The simplest UTF is UTF-32: we just take four bytes for each Unicode code point, and that’s that. It’s simple and doesn’t require anything special to decode (well, almost), but it means we are taking four bytes for every glyph. Given that a large percentage of the time, computers are dealing with ASCII characters, we’ve just made our character strings four times larger. No one wants that.
So, UTF-8 and UTF-16 were invented, and were one of the cleverest inventions ever. The rules for both are:
- Consider the Unicode code point as a 32 bit number.
- If the upper 25 bits (for UTF-8) or 17 bits (for UTF-16) are zero, the encoding is just the lower 7 bits (for UTF-8) or 15 bits (for UTF), encoding as one or two 8-bit bytes.
- If any of the upper 25/17 bits are not zero, the code point is broken into multiple 1-byte (UTF-8) or 2-byte (UTF-16) sequences based on some clever rules,
One of the very nice features of Unicode is that any code point where the upper 25 bits are zero is exactly the same as the ASCII code point for the same glyph, so text encoded in ASCII is also encoded in UTF-8 with no other processing.
UTF-8 is the overwhelming favorite for encoding Unicode.
One downside of UTF-8 is that a “character” in the traditional sense is no longer fixed length, so you can’t just count bytes to tell how many characters are in a string. Programming languages have struggled with this for years, although the situation seems to be finally calming down.
In order to process UTF-8 property, the code must know that it’s getting UTF-8. If you have ever seen ugly things like βΓΓΉ
on a web page where a β
should be, you’ve seen what happens when UTF-8 isn’t interpreted properly.
Collations
Computers have to compare character strings. From this really simple requirement, no end of pain has flowed.
A collation is just a function that takes two strings (that is, ordered sequences of glyphs represented as code points), and says whether or not they are less than, equal to, or greater than each other:
β f(string1, string2) β (<, =, >)
The problem, as always, is human beings. Different languages have different rules for what order strings should appear in, even if exactly the same glyphs are used. Although there have been some heroic attempts to define collation rules that span languages, languages and they way they are written tend to be very emotional issues tied up with feelings of culture and national pride (βa language is a dialect with an army and a navyβ).
There are a lot of collations. Collations by their nature are associated with a particular character encoding, but one character encoding can have many different collations. For example, just on my laptop, UTF-8 has 53 different collations.
Locale
On POSIX-compatible systems (which include Linux and *BSD), collations are one part of a bundle called a locale. A locale includes several different utility functions; they typically are:
- Number format setting
- Character encoding
- Some related utility functions (such as how to convert between cases in the encoding).
- Date-time format setting
- The string collation to use
- Currency format setting
- Paper size setting
Locales on POSIX systems have names that look like:
β fr_BE.UTF-8
fr
defines the langauge (French), BE
defines the “territory” (Belgium, so we are talking about Belgian French), and UTF-8
is the character encoding. From this, we can determine that the character strings are encoded as UTF-8, and the collation rules are the rules for Belgian French.
If there’s only one character encoding available for a particular combination of language and territory, you’ll sometimes see the locale written without the encoding, like fr_BE
.
There is one locale (with two different names) that is a big exception to the rules, and an important one:
C or POSIX
The C
locale (also called POSIX
) uses rules defined by the C/C++ language standard. This means that the system doesn’t have an opinion about what encoding the strings are in, and it’s up to the programmer to keep track of them. (This is really true in any locale, but in the C
locale, the system doesn’t even provide a hint as to what encoding to use or expect.) The collation function just compares strings byte-wise; this works great on ASCII, and is meaningless (even for equality!) on UTF-8 or UTF-16. For functions, like case conversion, that need to know a character encoding, the C locale uses the rules for ASCII, which are completely and utterly wrong for most other encodings.
PostgreSQL also provides a C.UTF-8
locale, which is a special case on a special case. We’ll talk about that in a future installment.
Locale Provider
Locales are really just bundles of code and data, so they need to come from somewhere so the system can use them. The code libraries that contain locales are locale providers. There are two important ones (plus one other that’s relevant to PostgreSQL only).
(g)libc
libc
is the library that implements the standard C library (and a lot of other stuff) on POSIX systems. It’s where 99% of programs on POSIX systems get locale information from. Different distributions of Linux and *BSD systems have different sets of locales that are provided as part of the base system, and others than can be installed in optional packages. You’ll usually see libc
as glibc
, which is the Free Software Foundation’s GNU Project’s version of libc.
ICU
The International Components for Unicode is a set of libraries from the Unicode Consortium to provide utilties for handling Unicode. ICU includes a large number of functions, including a large set of locales. One of the most important tools provided by ICU is an implementation of the Unicode Collation Algorithm, which provides a basic collation algorithm usable across multple languages.
(Full disclosure: The work that ultimately became the ICU was started at Taligent, where I was the networking architect. It’s without a doubt the most useful thing Taligent ever produced.)
The PostgreSQL Built-In Provider
In version 17, PostgreSQL introduced a built-in locale provider. We’ll talk about that in a future installment.
What’s Next?
In the next installment, we’ll talk about how PostgreSQL uses all of this, and why it has resulted in some really terrible situations.
10 July 2024
10:48
I turned the last blog post into a talk; you can get the slides here.
25 March 2024
10:00
The PostgreSQL roles and privileges system can be full of surprises.
Let’s say we have a database test
, owned by user owner
. In it, we create a very secret function f
that we do not want just anyone to be able to execute:
test=> select current_user;
current_user
--------------
owner
(1 row)
test=> CREATE FUNCTION f() RETURNS int as $$ SELECT 1; $$ LANGUAGE sql;
CREATE FUNCTION
test=> select f();
f
---
1
(1 row)
There are two other users: hipriv
and lowpriv
. We want hipriv
to be able to run the function, but not lowpriv
. So, we grant EXECUTE
to hipriv
, but revoke it from lowpriv
:
test=> GRANT EXECUTE ON FUNCTION f() TO hipriv;
GRANT
test=> REVOKE EXECUTE ON FUNCTION f() FROM lowpriv;
REVOKE
Let’s test it! We log in as hipriv
and run the function:
test=> SELECT current_user;
current_user
--------------
hipriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
Works great. Now, let’s try it as lowpriv
:
test=> SELECT current_user;
current_user
--------------
lowpriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
Wait, what? Why did it let lowpriv
run f()
? We explicitly revoked that permission! Is the PostgreSQL privileges system totally broken?
Well, no. But there are some surprises.
Let’s look at the privileges on f()
:
test=> SELECT proacl FROM pg_proc where proname = 'f';
proacl
-----------------------------------------
{=X/owner,owner=X/owner,hipriv=X/owner}
(1 row)
The interpretation of each of the entries is “=/“. We see that owner
has X (that is, EXECUTE
) on f()
granted by itself, and hipriv
has EXECUTE
granted by owner
. But what’s with that first one that doesn’t have a role at the start? And where is our REVOKE
on lowpriv
?
The first thing that may be surprising is that there is no such thing as a REVOKE
entry in the privileges. REVOKE
removes a privilege that already exists; it doesn’t create a new entry that says “don’t allow this.” This means that unless there is already an entry that matches the REVOKE
, REVOKE
is a no-op.
The second thing is that if there is no role specified that, that means the special role PUBLIC
. PUBLIC
means “all roles.” So, anyone can execute f()
! This is the default privilege for new functions.
Combined, this means that when the function was created, EXECUTE
was granted to PUBLIC
. The REVOKE
was a no-op, because there was no explicit grant of privileges to lowpriv
.
How do we fix it? First, we can revoke that undesirable first grant to PUBLIC
:
test=> REVOKE EXECUTE ON FUNCTION f() FROM PUBLIC;
REVOKE
hipriv
can still run the function, because we gave it an explicit grant:
test=> SELECT current_user;
current_user
--------------
hipriv
(1 row)
test=> SELECT f();
f
---
1
(1 row)
But lowpriv
can’t skate in under the grant to PUBLIC
, so it can’t run the function anymore:
test=> SELECT current_user;
current_user
--------------
lowpriv
(1 row)
test=> SELECT f();
ERROR: permission denied for function f
So, if you are counting on the PostgreSQL privilege system to prevent roles from running functions (and accessing other objects), be sure you know what the default permissions are, and adjust them accordingly.
4 October 2023
08:16
The slides for my talk “Look It Up: Real-Life Database Indexing” are now available.
9 May 2023
10:58
I shouldn’t have to say this, but don’t use ChatGPT for technical advice.
In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space).
LLMs are not a replacement for expertise.
3 May 2023
09:55
Recently on one of the PostgreSQL mailing lists, someone wrote in asking if it was possible to get PostgreSQL to listen on two ports. The use case, to paraphrase, was that there was a heterogeneous mix of clients, some of which could connect with TLS, some of which couldn’t. They wanted the clients that could use TLS to do so, while allowing the non-TLS clients access.
The simple answer is: Upgrade your non-TLS clients already! But of course the world is a complicated place. And for reasons that weren’t given (but which we will accept for now), it has to be two different ports.
The PostgreSQL server itself can only listen on one port. But there were two options presented that could fix this:
- Run pgbouncer with TLS turned on, on a different port, and have it forward the connections to the PostgreSQL server via a local socket.
- Run stunnel to listen for TLS connections, and route those to PostgreSQL.
I don’t imagine many people will have this exact situation, but if you do… there are options!
2 May 2023
10:23
I’ll be speaking about Writing a Foreign Data Wrapper at PGCon 2023 in Ottawa, May 30-June 2, 2023. Do come! It’s the premiere technical/hacker conference for PostgreSQL.