postgresql when it's not your job

12:31

The Doom That Came To PostgreSQL: When Collations Change

15 November 2024

In this installment of our series on locales and PostgreSQL, we’ll talk about what can happen with the library that provides locales changes out from under a PostgreSQL database. It’s not pretty.

First, more about collations

Back in an earlier installment, we described a collation as a function that takes two strings, and returns if the strings are greater than, equal to, or less than each other:

coll(a, b) -> {greater than, equal, less than}

This is a slight oversimplification. What the actual collation function does is take a string, and returns a sort key:

coll(a) -> Sa

The sort keys have the property that if the strings they are derived from have (within the rules of the collation) a particular relationship, the sort keys do as well. This is very valuable, because it allows doing these comparisons repeatedly without having to rerun the (potentially expensive) collation function.

In the case of C locale, the collation function is the identity function: you compare exactly the same string of bytes that is present in the input. In more sophisticated locales, the collation function “normalizes” the input string into a string that can be repeatedly compared by just comparing the bytes.

What is an index, anyway?

Everyone who works with databases has an intuitive idea of what an index is, but for our purposes, an index is a cache of a partiuclar sortation of the input column. You can think of an index defined as:

sql CREATE INDEX on t(a) WHERE b = 1;

as a cache of the query predicate:

sql SELECT ... WHERE b = 1 ORDER BY a;

This allows the database engine to quickly find rows that use that predicate, as well as (often) retrieve a set of rows in sorted order without having to resort them.

The database automatically updates the “cache” when a row changes in a way that would alter the index. This generally works very, very well.

But:

Cache invalidation is hard.

Suppose that the collation function changes, but no one tells the database engine that. Now, the “cache” in the form of the index is invalid, since it holds the sort keys from the old collation function, but any new entries in the index will use the new sortation function.

This has happened.

Version 2.28 of glibc included many signficiant collation changes. There are many situations in which a PostgreSQL database whose indexes where built on an earlier version might find itself running on the newer version:

The result is suddenly, indexes aren’t working the way they should be: The results from queries on them that use text indexes can be wrong. This is particularly perplexing in the binary replica case, since queries on the system running the older version behave correctly, while ones on the newer versions are incorrect.

ICU has the same problem, although its collation functions change less often, so switching to ICU collations does not completely solve the problem.

Starting with version 10 (for ICU) and version 13 (for libc), PostgreSQL records the current version of the libraries being used as locale providers, and will issue a warning if the current version on the system does not match the version that was used previously:

WARNING: collation "xx-x-icu" has version mismatch DETAIL: The collation in the database was created using version 1.2.3.4, but the operating system provides version 2.3.4.5. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right library version.

This is a very good reason to monitor log output for warnings, since this is the only place that you might see this message.

It’s possible to get false positives here. A change to the library does not necessarily mean that the collation changed. However, it’s always a good idea to be cautious and rebuild objects that depend on changed collations. The PostgreSQL documentation provides instructions on how to do so. The PostgreSQL Wiki also has a good entry about this problem.

In the next installment, we’ll cover a subject near and dear to every database person’s heart: speed.

Comment