postgresql when it's not your job

15 November 2024

12:31

The Doom That Came To PostgreSQL: When Collations Change

In this installment of our series on locales and PostgreSQL, we’ll talk about what can happen with the library that provides locales changes out from under a PostgreSQL database. It’s not pretty.

First, more about collations

Back in an earlier installment, we described a collation as a function that takes two strings, and returns if the strings are greater than, equal to, or less than each other:

coll(a, b) -> {greater than, equal, less than}

This is a slight oversimplification. What the actual collation function does is take a string, and returns a sort key:

coll(a) -> Sa

The sort keys have the property that if the strings they are derived from have (within the rules of the collation) a particular relationship, the sort keys do as well. This is very valuable, because it allows doing these comparisons repeatedly without having to rerun the (potentially expensive) collation function.

In the case of C locale, the collation function is the identity function: you compare exactly the same string of bytes that is present in the input. In more sophisticated locales, the collation function “normalizes” the input string into a string that can be repeatedly compared by just comparing the bytes.

What is an index, anyway?

Everyone who works with databases has an intuitive idea of what an index is, but for our purposes, an index is a cache of a partiuclar sortation of the input column. You can think of an index defined as:

sql CREATE INDEX on t(a) WHERE b = 1;

as a cache of the query predicate:

sql SELECT ... WHERE b = 1 ORDER BY a;

This allows the database engine to quickly find rows that use that predicate, as well as (often) retrieve a set of rows in sorted order without having to resort them.

The database automatically updates the “cache” when a row changes in a way that would alter the index. This generally works very, very well.

But:

Cache invalidation is hard.

Suppose that the collation function changes, but no one tells the database engine that. Now, the “cache” in the form of the index is invalid, since it holds the sort keys from the old collation function, but any new entries in the index will use the new sortation function.

This has happened.

Version 2.28 of glibc included many signficiant collation changes. There are many situations in which a PostgreSQL database whose indexes where built on an earlier version might find itself running on the newer version:

The result is suddenly, indexes aren’t working the way they should be: The results from queries on them that use text indexes can be wrong. This is particularly perplexing in the binary replica case, since queries on the system running the older version behave correctly, while ones on the newer versions are incorrect.

ICU has the same problem, although its collation functions change less often, so switching to ICU collations does not completely solve the problem.

Starting with version 10 (for ICU) and version 13 (for libc), PostgreSQL records the current version of the libraries being used as locale providers, and will issue a warning if the current version on the system does not match the version that was used previously:

WARNING: collation "xx-x-icu" has version mismatch DETAIL: The collation in the database was created using version 1.2.3.4, but the operating system provides version 2.3.4.5. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION pg_catalog."xx-x-icu" REFRESH VERSION, or build PostgreSQL with the right library version.

This is a very good reason to monitor log output for warnings, since this is the only place that you might see this message.

It’s possible to get false positives here. A change to the library does not necessarily mean that the collation changed. However, it’s always a good idea to be cautious and rebuild objects that depend on changed collations. The PostgreSQL documentation provides instructions on how to do so. The PostgreSQL Wiki also has a good entry about this problem.

In the next installment, we’ll cover a subject near and dear to every database person’s heart: speed.

14 November 2024

17:57

Farm to TABLE: Local(e) Providers

In our this installment about character encodings, locations, and locales in PostgreSQL, we’re talking about locale providers.

A reminder about locales

A “locale” is a bundled combination of data structures and code that provides support services for handling different localization services. For our purposes, the two most important things that a locale provides are:

And as a quick reminder, a character encoding is a mapping between a sequence of one or more bytes and a “glyph” (what we generally think of as a charcter, like A or * or Γ… or 🀷). A collation is a function that places strings of characters into a total ordering.

UTF-8

Everything here assumes you are using UTF-8 encoding in your database. If you are not, please review why you should.

Getting locale

A locale provider is a library (in the computing sense, a hunk of code and data) whose API provides a set of locales and lets you use the functions in them. There are three interesting locale providers in the PostgreSQL world:

libc

When nearly everyone thinks of locales, they think of libc. Really, the mostly think of the ubiquitous en_US.UTF-8 locale. If you do not specify a locale when creating a new PostgreSQL instance using initdb, it uses the system locale, which very often en_US.UTF-8. (Or C locale, which is almost never the right choice.)

The structure of a glibc locale name is (using en_US.UTF-8 as an example):

The exact set of locales available on a particular system depend on the particular flavor of *NIX system, the distribution, and what packages have been installed. The command locale -a will list what’s currently installed, and locale will show the various settings of the current locale.

Different platforms can have different collations for the same locale when using libc, and the collation rules for a particular locale can change between libc releases. (We’ll talk about what happens here in a future installment.)

ICU

ICU is an extremely powerful system for dealing with text in a wide variety of languages, and has a head-spinning number of options and configurations. The set of possible ICU collations is enormous, and this just scratches the surface.

In PostgreSQL, ICU locales can be used by creating one in the database using the CREATE COLLATION command. An example is:

CREATE COLLATION english (provider = icu, locale = 'en-US');

The basic structure of ICU collation names is simliar to that of libc collations: a language code, followed by an optional region (in ICU vernacular, these two components are a language tag). In addition to the language tag, locale names can take key/value pairs of options, introduced by a -u- extension in the locale name. An example from the PostgreSQL documentation is:

CREATE COLLATION mycollation5 (provider = icu, deterministic = false, locale = 'en-US-u-kn-ks-level2');

Taking the components of the name piece by piece:

ICU also defines a language code of und meaning (“undefined”). To distinguish libc-style collations from ICU-style locations when only a language tag is given, -x-icu is added to the end of the name.

ICU is a library like libc, and like libc, the collation algorithms can change between releases (although, in practice, this happens less frequently than it does with libc).

builtin

As of PostgreSQL 17, PostgreSQL has a built-in locale provider, using the idiosyncratic name of builtin. The builtin provider, right now, defines exactly two locales:

So, why is there a builtin provider at all? Doesn’t libc and ICU cover the space completely. Well, yes, with one very important exception, which we’ll talk about in our next installment.

27 October 2024

03:18

Speaking in Tongues: PostgreSQL and Character Encodings

This is the second installment in our discussion of locales, character encodings, and collations in PostgreSQL. In this installment, we’ll talk about character encodings as they relate to PostgreSQL.

A quick reminder!

A character encoding is a mapping between code points (that is, numbers) and glyphs (what us programmers usually call characters). There are lots, and lots, and lots of different character encodings, most of them some superset of good old 7-bit ASCII.

PostgreSQL and Encodings

From the point of view of the computer, a character string is just a sequence of bytes (maybe terminated by a zero byte, maybe with a length). If the only thing PostgreSQL had to do with character strings was store them and return them to the client, it could just ignore that character encodings even exist.

However, databases don’t just store character strings: they also compare them, build indexes on them, change them to upper case, do regex searches on them, and other things that mean they need know what the characters are and how to manipulate them. So, PostgreSQL needs to know what character encoding applies to the text it is storing.

You specify the character encoding for a database when creating the database. (Each database in a PostgreSQL instance can have a different encoding.) That encoding applies to all text stored in the database, and can’t be changed once the database is created: your only option to change encodings is to use pg_dump (or a similar copy-based system, like logical replication) and transfer the data to a new database with the new encoding. This makes the choice of what encoding to pick up front very important.

This is stupid and I don’t want to think about it.

If you search the web, you’ll see a lot of advice that you should just “ignore locales” in PostgreSQL and just use “C” locale. This advice usually dates from around when PostgreSQL first introduced proper locale support, and there was a certain amount of culture shock in having to deal with these issues.

This is terrible advice. If you are processing human-readable text, you need to think about these topics. Buckle up.

The good news is that there is a correct answer to the question, “What encoding should I use?”

Use UTF-8

The character encoding decision is an easy one: Use UTF-8. Although one can argue endlessly over whether or not Unicode is the “perfect” way of encoding characters, we almost certainly will not get a better one in our lifetime, and it has become the de facto (and in many cases, such as JSON, the de jure) standard for text on modern systems.

The only exception is that if you are building a highly specialized database that will only ever accept text in a particular encoding. The chance of this is extremely small, an even then, there is a strong argument for converting the text to UTF-8 instead.

But what about C encoding?

There is a downside to using UTF-8: It’s slower than basic “C” encoding. (“C” encoding, as a reminder, just takes the binary strings as it finds them and compares them that way, without regard to what natural language code points they represent.) The difference is significant: on a quick test of sorting 1,000,000 strings, each 64 characters and ASCII-only, UTF-8 encoding with the en_UTF8 collation was almost 18x times slower than C encoding and collation.

There are, however, some significant problems:

Unless sorting a large number of character strings on a regular basis is a bottleneck, or you know that you will never have strings that correspond to natural language, the performance improvement is not going to be worth the inevitable issues. Just use UTF-8 coding.

(There are some reasonable alternatives we will discuss later, with their own set of trade-offs.)

Coming up next…

Having decided that we are going to use UTF-8 as the character encoding, which collation should we use? PostgreSQL has an embarrassingly large number of options here, and version 17 introduced some new ones!

25 October 2024

03:50

“Gentlemen, this is a 🏈”: Glyphs, Encodings, Collations, and Locales

If you are not familiar with the quote.

This is part one of a series on PostgreSQL and collations, and how to use them without tears. This is an introduction to the general concepts of glyphs, character encodings, collations, and locales.

Glyphs

There is (as always in things involving real human behavior) some controversy over what is a gylph and what isn’t, but as a first approximation, we can use this definition:

These are all glyphs:

A z 4 ΓΌ

Symbols are also glyphs:

^ @ ! ‰

Some things are both:

∏

Capital Ο€ is both a mathematic symbol and a letter of the Greek alphabet. Of course, most Greek letters are also mathematical symbols.

Glyphs also include characters from other languages that are not alphabetic:

ι€™δΊ›ιƒ½ζ˜―ηΉι«”δΈ­ζ–‡ηš„ε­—ε½’
αŽ£αα“ α‘αŽΎαŽ΄αŽ’

And glyphs include symbols that aren’t traditional language components at all:

🏈 πŸ“ πŸ”  πŸ›‘

Strictly speaking, there is a distinction between a “glyph” and a “grapheme.” The “glyph” is the mark itself, which can be different depending on typeface, weight, etc. The “grapheme” is the underlying functional unit that the glyph represents. In this definition, these are all the same “grapheme” (lower-case Latin a) but different glyphs:

A series of the lower-case letter a in different fonts

However, everything involving characters in computing calls the fundamental units “glyphs” rather than “graphemes,” so that’s the term you’ll see here.

A glyph corresponds to what most programmers intuitively think of as a character, and the terms are more or less interchangeable, with some exceptions that we’ll talk about in future articles.

Character Encodings

We need computers to be able to understand glyphs. That means we need to turn glyphs into numbers. (A number that represents a glyph is that glyph’s code point.)

For that, we have a character encoding. A character encoding is a bidrectional mapping between a set of glyphs and a set of numbers. For programmers, probably the most familiar character encoding is ASCII, which maps a (very) limited set of 95 glyphs to 7-bit numbers. (There are encodings even older than ASCII.) Some greatest ASCII hits include:

    A ↔ 65 (0x41)
    a ↔ 97 (0x61)
    * ↔ 42 (0x2A)

ASCII has been around for a long time (it achieved its modern form in 1967). It has some nice features: if an ASCII code represents an upper-case letter, it can be turned into the matching lower-case letter with code + 0x20, and back to upper-case with code - 0x20. If an ASCII code is a numeric glyph, it can be turned into the actual numeric equivalent with code - 0x30.

(You’ll sometimes hear the phrase “7-bit ASCII.” This is, strictly speaking, redundant. ASCII doesn’t define any code points above 127. Everything that uses the range 128-255 is an extension to ASCII. “7-bit ASCII” is useful, though, when you want to specify “ASCII with no extensions.”)

ASCII has huge limitations: It was designed to represent the characters that appeared on a typical English-language computer terminal in 1967, which means it lacks glyphs for the vast majority of languages in the world (even English! Γ―, for example, is very commonly used in English, as is Γ©).

It became instantly obvious that ASCII wasn’t going to cut it as a worldwide character encoding. Diffferent languages started developing their own character encodings, some based on ASCII codes greater than 127, some on newly-invented character encodings that took more than a single 8-bit byte to represent. Chaos reigned.

Unicode

In an attempt to bring some kind of order to character encodings, an effort began in the late 1980s to create a universal encoding that would include all glyphs from all languages. This was just as ambitious as it sounds! In 1991, the Unicode Consortium was formed, and the first version of Unicode was published in October 1991.

Unicode code points are 32 bit numbers, organized into “code planes.” The “Basic Multilingual Plane,” which contains most glyphs for most living languages, has the top 16 bits equal to zero, but Unicode code points can be, and often are in real life, greater than 65,535.

So, is Unicode a “character encoding”? Well, yes and no. It maps glyphs to code points and back again, so it qualifies as a character encoding in that sense. But Unicode deliberately does not specify how the particular code points are to be stored inside the computer.

To find out how the code points are actually stored inside the computer, we need to talk about Unicode Transformation Formats.

UTF-8, UTF-16, etc.

A Unicode Transformation Format (UTF) is a way of encoding a 32-bit Unicode code point. (Yes, we’re encoding an encoding. Computers are great!)

The simplest UTF is UTF-32: we just take four bytes for each Unicode code point, and that’s that. It’s simple and doesn’t require anything special to decode (well, almost), but it means we are taking four bytes for every glyph. Given that a large percentage of the time, computers are dealing with ASCII characters, we’ve just made our character strings four times larger. No one wants that.

So, UTF-8 and UTF-16 were invented, and were one of the cleverest inventions ever. The rules for both are:

One of the very nice features of Unicode is that any code point where the upper 25 bits are zero is exactly the same as the ASCII code point for the same glyph, so text encoded in ASCII is also encoded in UTF-8 with no other processing.

UTF-8 is the overwhelming favorite for encoding Unicode.

One downside of UTF-8 is that a “character” in the traditional sense is no longer fixed length, so you can’t just count bytes to tell how many characters are in a string. Programming languages have struggled with this for years, although the situation seems to be finally calming down.

In order to process UTF-8 property, the code must know that it’s getting UTF-8. If you have ever seen ugly things like β€šΓ„ΓΉ on a web page where a β€œ should be, you’ve seen what happens when UTF-8 isn’t interpreted properly.

Collations

Computers have to compare character strings. From this really simple requirement, no end of pain has flowed.

A collation is just a function that takes two strings (that is, ordered sequences of glyphs represented as code points), and says whether or not they are less than, equal to, or greater than each other:

​ f(string1, string2) β†’ (<, =, >)

The problem, as always, is human beings. Different languages have different rules for what order strings should appear in, even if exactly the same glyphs are used. Although there have been some heroic attempts to define collation rules that span languages, languages and they way they are written tend to be very emotional issues tied up with feelings of culture and national pride (β€œa language is a dialect with an army and a navy”).

There are a lot of collations. Collations by their nature are associated with a particular character encoding, but one character encoding can have many different collations. For example, just on my laptop, UTF-8 has 53 different collations.

Locale

On POSIX-compatible systems (which include Linux and *BSD), collations are one part of a bundle called a locale. A locale includes several different utility functions; they typically are:

Locales on POSIX systems have names that look like:

​ fr_BE.UTF-8

fr defines the langauge (French), BE defines the “territory” (Belgium, so we are talking about Belgian French), and UTF-8 is the character encoding. From this, we can determine that the character strings are encoded as UTF-8, and the collation rules are the rules for Belgian French.

If there’s only one character encoding available for a particular combination of language and territory, you’ll sometimes see the locale written without the encoding, like fr_BE.

There is one locale (with two different names) that is a big exception to the rules, and an important one:

C or POSIX

The C locale (also called POSIX) uses rules defined by the C/C++ language standard. This means that the system doesn’t have an opinion about what encoding the strings are in, and it’s up to the programmer to keep track of them. (This is really true in any locale, but in the C locale, the system doesn’t even provide a hint as to what encoding to use or expect.) The collation function just compares strings byte-wise; this works great on ASCII, and is meaningless (even for equality!) on UTF-8 or UTF-16. For functions, like case conversion, that need to know a character encoding, the C locale uses the rules for ASCII, which are completely and utterly wrong for most other encodings.

PostgreSQL also provides a C.UTF-8 locale, which is a special case on a special case. We’ll talk about that in a future installment.

Locale Provider

Locales are really just bundles of code and data, so they need to come from somewhere so the system can use them. The code libraries that contain locales are locale providers. There are two important ones (plus one other that’s relevant to PostgreSQL only).

(g)libc

libc is the library that implements the standard C library (and a lot of other stuff) on POSIX systems. It’s where 99% of programs on POSIX systems get locale information from. Different distributions of Linux and *BSD systems have different sets of locales that are provided as part of the base system, and others than can be installed in optional packages. You’ll usually see libc as glibc, which is the Free Software Foundation’s GNU Project’s version of libc.

ICU

The International Components for Unicode is a set of libraries from the Unicode Consortium to provide utilties for handling Unicode. ICU includes a large number of functions, including a large set of locales. One of the most important tools provided by ICU is an implementation of the Unicode Collation Algorithm, which provides a basic collation algorithm usable across multple languages.

(Full disclosure: The work that ultimately became the ICU was started at Taligent, where I was the networking architect. It’s without a doubt the most useful thing Taligent ever produced.)

The PostgreSQL Built-In Provider

In version 17, PostgreSQL introduced a built-in locale provider. We’ll talk about that in a future installment.

What’s Next?

In the next installment, we’ll talk about how PostgreSQL uses all of this, and why it has resulted in some really terrible situations.

10 July 2024

10:48

Checking Your Privileges, 2

I turned the last blog post into a talk; you can get the slides here.

25 March 2024

10:00

Checking Your Privileges

The PostgreSQL roles and privileges system can be full of surprises.

Let’s say we have a database test, owned by user owner. In it, we create a very secret function f that we do not want just anyone to be able to execute:

test=> select current_user;
 current_user 
--------------
 owner
(1 row)

test=> CREATE FUNCTION f() RETURNS int as $$ SELECT 1; $$ LANGUAGE sql;
CREATE FUNCTION
test=> select f();
 f 
---
 1
(1 row)

There are two other users: hipriv and lowpriv. We want hipriv to be able to run the function, but not lowpriv. So, we grant EXECUTE to hipriv, but revoke it from lowpriv:

test=> GRANT EXECUTE ON FUNCTION f() TO hipriv;
GRANT
test=> REVOKE EXECUTE ON FUNCTION f() FROM lowpriv;
REVOKE

Let’s test it! We log in as hipriv and run the function:

test=> SELECT current_user;
 current_user 
--------------
 hipriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

Works great. Now, let’s try it as lowpriv:

test=> SELECT current_user;
 current_user 
--------------
 lowpriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

Wait, what? Why did it let lowpriv run f()? We explicitly revoked that permission! Is the PostgreSQL privileges system totally broken?

Well, no. But there are some surprises.

Let’s look at the privileges on f():

test=> SELECT proacl FROM pg_proc where proname = 'f';
                 proacl                  
-----------------------------------------
 {=X/owner,owner=X/owner,hipriv=X/owner}
(1 row)

The interpretation of each of the entries is “=/“. We see that owner has X (that is, EXECUTE) on f() granted by itself, and hipriv has EXECUTE granted by owner. But what’s with that first one that doesn’t have a role at the start? And where is our REVOKE on lowpriv?

The first thing that may be surprising is that there is no such thing as a REVOKE entry in the privileges. REVOKE removes a privilege that already exists; it doesn’t create a new entry that says “don’t allow this.” This means that unless there is already an entry that matches the REVOKE, REVOKE is a no-op.

The second thing is that if there is no role specified that, that means the special role PUBLIC. PUBLIC means “all roles.” So, anyone can execute f()! This is the default privilege for new functions.

Combined, this means that when the function was created, EXECUTE was granted to PUBLIC. The REVOKE was a no-op, because there was no explicit grant of privileges to lowpriv.

How do we fix it? First, we can revoke that undesirable first grant to PUBLIC:

test=> REVOKE EXECUTE ON FUNCTION f() FROM PUBLIC;
REVOKE

hipriv can still run the function, because we gave it an explicit grant:

test=> SELECT current_user;
 current_user 
--------------
 hipriv
(1 row)

test=> SELECT f();
 f 
---
 1
(1 row)

But lowpriv can’t skate in under the grant to PUBLIC, so it can’t run the function anymore:

test=> SELECT current_user;
 current_user 
--------------
 lowpriv
(1 row)

test=> SELECT f();
ERROR:  permission denied for function f

So, if you are counting on the PostgreSQL privilege system to prevent roles from running functions (and accessing other objects), be sure you know what the default permissions are, and adjust them accordingly.

4 October 2023

08:16

“Look It Up: Real-Life Database Indexing” at PgConf.NYC

The slides for my talk “Look It Up: Real-Life Database Indexing” are now available.

9 May 2023

10:58

Don’t use ChatGPT to solve problems.

I shouldn’t have to say this, but don’t use ChatGPT for technical advice.

In an experiment, I asked 40 questions about PostgreSQL. 23 came back with misleading or simply inaccurate information. Of those, 9 came back with answers that would have caused (at best) performance issues. One of the answers could result in a corrupted database (deleting WAL files to recover disk space).

LLMs are not a replacement for expertise.

3 May 2023

09:55

Running PostgreSQL on two ports

Recently on one of the PostgreSQL mailing lists, someone wrote in asking if it was possible to get PostgreSQL to listen on two ports. The use case, to paraphrase, was that there was a heterogeneous mix of clients, some of which could connect with TLS, some of which couldn’t. They wanted the clients that could use TLS to do so, while allowing the non-TLS clients access.

The simple answer is: Upgrade your non-TLS clients already! But of course the world is a complicated place. And for reasons that weren’t given (but which we will accept for now), it has to be two different ports.

The PostgreSQL server itself can only listen on one port. But there were two options presented that could fix this:

  1. Run pgbouncer with TLS turned on, on a different port, and have it forward the connections to the PostgreSQL server via a local socket.
  2. Run stunnel to listen for TLS connections, and route those to PostgreSQL.

I don’t imagine many people will have this exact situation, but if you do… there are options!

2 May 2023

10:23

“Writing a Foreign Data Wrapper” at PGCon 2023

I’ll be speaking about Writing a Foreign Data Wrapper at PGCon 2023 in Ottawa, May 30-June 2, 2023. Do come! It’s the premiere technical/hacker conference for PostgreSQL.

« Older Entries