17:57

Farm to TABLE: Local(e) Providers

14 November 2024

In our this installment about character encodings, locations, and locales in PostgreSQL, we’re talking about locale providers.

A reminder about locales

A “locale” is a bundled combination of data structures and code that provides support services for handling different localization services. For our purposes, the two most important things that a locale provides are:

A character encoding.
A collation.

And as a quick reminder, a character encoding is a mapping between a sequence of one or more bytes and a “glyph” (what we generally think of as a charcter, like A or * or Å or 🤷). A collation is a function that places strings of characters into a total ordering.

UTF-8

Everything here assumes you are using UTF-8 encoding in your database. If you are not, please review why you should.

Getting locale

A locale provider is a library (in the computing sense, a hunk of code and data) whose API provides a set of locales and lets you use the functions in them. There are three interesting locale providers in the PostgreSQL world:

libc — This is the implementation of the C standard library on every POSIX-compatible system, and many others. You’ll most often see glibc, which is the GNU project’s implemenatation of libc.
ICU — The Inernational Components for Unicode. This is a (very) extensive library for working with Unicode text. Among other features, it has a comprehensive locale system.
builtin — As of version 17, PostgreSQL has a builtin collation provider, based on locales built into PostgreSQL itself.

libc

When nearly everyone thinks of locales, they think of libc. Really, the mostly think of the ubiquitous en_US.UTF-8 locale. If you do not specify a locale when creating a new PostgreSQL instance using initdb, it uses the system locale, which very often en_US.UTF-8. (Or C locale, which is almost never the right choice.)

The structure of a glibc locale name is (using en_US.UTF-8 as an example):

en — The base language of the locale, usually the ISO 639 code.
US — Is the “region.” The “region” is used if there are subtypes of locales within the language. US means US (“American”) English.
UTF-8 — Is the character encoding.
There’s also a (rarely seen in practice) optional “modifier” that can be postpended, staring with an @. For example, @euro means to use the Euro currency symbol. (Not all modifiers apply to all locales.)

The exact set of locales available on a particular system depend on the particular flavor of *NIX system, the distribution, and what packages have been installed. The command locale -a will list what’s currently installed, and locale will show the various settings of the current locale.

Different platforms can have different collations for the same locale when using libc, and the collation rules for a particular locale can change between libc releases. (We’ll talk about what happens here in a future installment.)

ICU

ICU is an extremely powerful system for dealing with text in a wide variety of languages, and has a head-spinning number of options and configurations. The set of possible ICU collations is enormous, and this just scratches the surface.

In PostgreSQL, ICU locales can be used by creating one in the database using the CREATE COLLATION command. An example is:

CREATE COLLATION english (provider = icu, locale = 'en-US');

The basic structure of ICU collation names is simliar to that of libc collations: a language code, followed by an optional region (in ICU vernacular, these two components are a language tag). In addition to the language tag, locale names can take key/value pairs of options, introduced by a -u- extension in the locale name. An example from the PostgreSQL documentation is:

CREATE COLLATION mycollation5 (provider = icu, deterministic = false, locale = 'en-US-u-kn-ks-level2');

Taking the components of the name piece by piece:

en — English.
US — United States English.
u — Introduces Unicode Consortium-specific options.
kn — Sort numbers as a single numeric value rather than as a sequence of digit characters. ‘123’ will sort after 45, for example. There is no “value” part, as kn is a true/false flag.
ks-level2 — Sets the “sensitivity” of the collation to level2. You can think of the “level” as how much the collation “cares” about differences between characters. level2 means that the collation doesn’t “care” about upper-case vs lower case characters, and will thus do case-insensitive sortation.

ICU also defines a language code of und meaning (“undefined”). To distinguish libc-style collations from ICU-style locations when only a language tag is given, -x-icu is added to the end of the name.

ICU is a library like libc, and like libc, the collation algorithms can change between releases (although, in practice, this happens less frequently than it does with libc).

builtin

As of PostgreSQL 17, PostgreSQL has a built-in locale provider, using the idiosyncratic name of builtin. The builtin provider, right now, defines exactly two locales:

C — This is identical to the C locale provided by libc, but is implemented within the PostgreSQL core.
C.UTF-8 — This sorts based on the numeric Unicode codepoint rather than just treating the text as a sequence of bytes (which ignores encodings). This produces results that, while often not correct for any language except US English (pretty much), are deterministic, and performs very well. Needless to say, this is only available for databases in UTF-8 encoding.

So, why is there a builtin provider at all? Doesn’t libc and ICU cover the space completely. Well, yes, with one very important exception, which we’ll talk about in our next installment.

There are no comments yet.

Comments are closed.