postgresql when it's not your job

00:00

Locale Cooking: Common Scenarios and Suggestions

22 November 2024

We’ve gone through a lot of detail about locales and collations here, but what should you do when it is time to set up a database? Here is a cookbook with some common scenarios, with recommendations.

“I want maximum speed, I am running on PostgreSQL version 17 or higher, and it’s OK if collation is whacky for non-7-bit-ASCII characters.”

Use the C.UTF-8 local from the built-in locale provider.

CREATE DATABASE mydb locale_provider=builtin builtin_locale='C.UTF8' template=template0;

The built-in C.UTF-8 locale collates text based on the Unicode codepoints. You won’t get broken or invalid characters (as you would with the C/POSIX locale), but sortation will be incorrect for some languages. However, this gets you a good combination of reliable (if sometimes whacky) sort order, and very good performance. You also avoid any issues with underlying locale provider libraries changing.

“I want maximum speed, I am running on a version of PostgreSQL before 17, and I swear that I will never, ever put a character in the DB that is not 7-bit ASCII.”

Then you have my permission to use the C/POSIX locale.

CREATE DATABASE mydb locale='POSIX' template=template0;

C/POSIX locale just uses the C standard library function strcmp, so it does byte by byte comparisons, ignoring encoding entirely. This also avoids issues with locale provider libraries changing. If you do put in non-7-bit-ASCII characters, the sortation will be completely whacky, so don’t do that.

“I’m willing to trade off a little bit of speed so that my collations are reasonable across languages, or I am on a version before 17 and do want to use non-7-bit-ASCII characters.”

Use the ICU und-x-icu locale.

CREATE DATABASE mydb locale_provider=icu icu_locale='und-x-icu' encoding='UTF8' template=template0;

This provides reasonable collation order for most languages, and is (usually) fast compared to libc locales (except POSIX). The icu libraries also can change in a breaking fashion less often than libc, although either one can change in an unfortunate way.

“I need my collation to be spot-on correct for one particular language, but that will be the only language in the database.”

Use either the libc or icu locale specific to your language requirements. Which one you use depends on which one will suit your needs better. Which one is faster depends on the specific collation, and icu has more features for getting very specific with the collation than libc does. If it is a tossup, icu libraries tend to have fewer breaking changes than libc libraries over time.

An example would be:

CREATE DATABASE mydb locale_provider=icu icu_locale='de-AT' encoding='UTF8' template=template0;

or

CREATE DATABASE mydb locale_provider=libc locale='de_AT.UTF-8' template=template0;

“My database will have different languages in it, and I’d like collation to be spot-on correct for those languages.”

Create the database as und-x-icu, and then put each language in its own table, or column within the table, specifying the appropriate collation for each.

“` CREATE DATABASE mydb localeprovider=icu iculocale=’und-x-icu’ encoding=’UTF8′ template=template0;

CREATE COLLATION de_at (provider=icu, locale=’de-AT’);

CREATE TABLE stuff (deat text collate deat); “`

Doing useful multi-language full-text search is (way) beyond the scope here.

“I am on PostgreSQL 17 or higher, and I never, ever want to worry about my locale provider library changing on me.”

Use the C.UTF-8 locale from the built-in locale provider.

CREATE DATABASE mydb locale_provider=builtin builtin_locale='C.UTF8' template=template0;

You will get slightly whacky collation for non-7-bit-ASCII characters, but you’ll never have to worry about the locale provider changing in a bad way.

“I am on a version of PostgreSQL earlier than 17, and I never, ever want to worry about my locale provider library changing on me.”

Then you are stuck using C/POSIX locale.

CREATE DATABASE mydb locale='POSIX' template=template0;

The good news is that if strcmp changes in some breaking way, we’ll all have a lot more to worry about than PostgreSQL collation (like Linux not working anymore). Remember that if you have non-7-bit-ASCII characters, they’ll collate in some unfortunate way.

“I’m using either libc or icu, and I am terrified that they will change underneath me and break things.”

The solution here is to:

  1. Upgrade to PostgreSQL 17.
  2. Never, ever run a primary and a binary replica on two different versions of the OS.
  3. When something underlying PostgreSQL changes (new OS, etc.), be sure to respond at once to the WARNING that is generated when a locale provider library changes by doing a cluster-wide REINDEX DATABASE.

“Do I really have to worry about this?”

Sorry, yes, you do. The good news is that you generally only have to worry about it when first setting up the database, or when moving a database to a new OS (or upgrading the OS or packages on an existing server).

And that concludes our look at locales in PostgreSQL! Thanks for reading.

Comments are closed.