22:10
Talk Fast: The Speed of Various Collations
20 November 2024
In previous installments in this series about locales and collations in PostgreSQL, we’ve made some vague allusions to the speed of the various collation functions. Let’s be a bit more analytical now.
The data here was gathered on a 4GB Linode instance running Ubuntu 24.04 and PostgreSQL 17.1. The test data was 1,000,000 records, each one a string of 64 random 7-bit ASCII characters. For each of the configurations, the test data was loaded into a table:
sql
CREATE TABLE t(v varchar);
And then a simple sort on the data was timed (discarding the first run to control for cold-cache effects, and averaging over five others):
sql
SELECT * FROM t ORDER BY v;
work_mem
was set high enough that the sort would be executed in memory, and max_parallel_workers_per_gather
was set to 0 to avoid different query plans obscuring the underlying library performance. This is hardly an exhaustive test of performance, but it gives a reasonable idea of the relationship between the various collation methods.
| Locale Provider | Locale | Time | | --------------- | --------- | ------ | | builtin | C.UTF8 | 486ms | | libc | POSIX | 470ms | | icu | und-x-icu | 765ms | | icu | de-AT | 770ms | | libc | de_AT | 3645ms |
Unsurprisingly, the POSIX
locale was the fastest. However, the built-in C.UTF8
collation was nearly as fast (really, wiithin the margin of error), and produces reasonable and consistent results, if not exactly correct for many languages.
The other two icu
tests used the und-x-icu
locale (which provides a reasonable sort order for most languages) and a language-specific collation (German, Austrian region). Their performance was slower than C.UTF8
or POSIX
, but it’s unlikely that by itself will cause a significant application performance issue, unless the application spends a lot of time sorting strings.
The libc
equivalent to the de-AT
icu locale took much longer than icu
to complete the sort: almost 5x longer. There’s probably not a good reason to use the language-specific libc
locales over the icu
locales, especially with a performance different that substantial. icu
is also more consistent platform-to-platform than libc
.
So, knowing all of this, what should you do? In our last installment, we’ll discuss some different scenarios and what locales to use for them.
There are no comments yet.
Comment