postgresql when it's not your job

17:57

Farm to TABLE: Local(e) Providers

14 November 2024

In our this installment about character encodings, locations, and locales in PostgreSQL, we’re talking about locale providers.

A reminder about locales

A “locale” is a bundled combination of data structures and code that provides support services for handling different localization services. For our purposes, the two most important things that a locale provides are:

And as a quick reminder, a character encoding is a mapping between a sequence of one or more bytes and a “glyph” (what we generally think of as a charcter, like A or * or Γ… or 🀷). A collation is a function that places strings of characters into a total ordering.

UTF-8

Everything here assumes you are using UTF-8 encoding in your database. If you are not, please review why you should.

Getting locale

A locale provider is a library (in the computing sense, a hunk of code and data) whose API provides a set of locales and lets you use the functions in them. There are three interesting locale providers in the PostgreSQL world:

libc

When nearly everyone thinks of locales, they think of libc. Really, the mostly think of the ubiquitous en_US.UTF-8 locale. If you do not specify a locale when creating a new PostgreSQL instance using initdb, it uses the system locale, which very often en_US.UTF-8. (Or C locale, which is almost never the right choice.)

The structure of a glibc locale name is (using en_US.UTF-8 as an example):

The exact set of locales available on a particular system depend on the particular flavor of *NIX system, the distribution, and what packages have been installed. The command locale -a will list what’s currently installed, and locale will show the various settings of the current locale.

Different platforms can have different collations for the same locale when using libc, and the collation rules for a particular locale can change between libc releases. (We’ll talk about what happens here in a future installment.)

ICU

ICU is an extremely powerful system for dealing with text in a wide variety of languages, and has a head-spinning number of options and configurations. The set of possible ICU collations is enormous, and this just scratches the surface.

In PostgreSQL, ICU locales can be used by creating one in the database using the CREATE COLLATION command. An example is:

CREATE COLLATION english (provider = icu, locale = 'en-US');

The basic structure of ICU collation names is simliar to that of libc collations: a language code, followed by an optional region (in ICU vernacular, these two components are a language tag). In addition to the language tag, locale names can take key/value pairs of options, introduced by a -u- extension in the locale name. An example from the PostgreSQL documentation is:

CREATE COLLATION mycollation5 (provider = icu, deterministic = false, locale = 'en-US-u-kn-ks-level2');

Taking the components of the name piece by piece:

ICU also defines a language code of und meaning (“undefined”). To distinguish libc-style collations from ICU-style locations when only a language tag is given, -x-icu is added to the end of the name.

ICU is a library like libc, and like libc, the collation algorithms can change between releases (although, in practice, this happens less frequently than it does with libc).

builtin

As of PostgreSQL 17, PostgreSQL has a built-in locale provider, using the idiosyncratic name of builtin. The builtin provider, right now, defines exactly two locales:

So, why is there a builtin provider at all? Doesn’t libc and ICU cover the space completely. Well, yes, with one very important exception, which we’ll talk about in our next installment.

Comment