The Build


PostgreSQL Performance When It’s Not Your Job

24 January 2012

My presentation from SCALE 10x, “PostgreSQL Performance When It’s Not Your Job” is now available for download.

Tomas at 03:55, 25 January 2012:

Very nice. Just a few minor corrections and oppinions:

slide 35: It can actually take 2x the amount of “checkpoint_segments * 16MB” due to completion_target (you already have checkpoint_segments, the DB plans to write them in the next few minutes and suddenly something writes another checkpoint_segments so you have twice the number)

slide 39: The checkpoint settings do influence the number of changes that need to be written. Significantly. If a buffer is modified repeatedly and the checkpoint do happen frequently, it will be written repeatedly. Imagine a process that runs for one hour and inserts a new row to a particular block every minute. If the checkpoints happen every minute, the block will be written each time – i.e. 60x. If the checkpoints happen less frequently (because the timeout / number of segments are higher), the inserts will be consolidated into a single write. If there are only 6 checkpoints, the buffer will be written just once.

slide 57: Since 8.1, any of the index columns can be used, although it’s not as effective as using the leading ones. See this

slide 62: I wouldn’t say using a database as a filesystem is a stupid trick. There are cases where having an ACID compliant transactional file system is great. The same is true for job queues.

slide 63: It might be handy how wal_level=minimal can be used with COPY to lower the amount of WAL.

slide 64: There are known issues with prepared statements, but I wouldn’t say the common situation is a total loss.

Xof at 12:11, 26 January 2012:

Thanks for taking the time to comment!

Slide 35 — Good catch; I’ll fix it in future versions.

Slide 39 — That’s true, but I’d say that is a relatively small effect, except for some pathological cases; the WAL activity will be a much more significant factor.

Slide 57 — The slide might be unclear. In a multicolumn index (a, b, c), it will not be useful for queries that do not involve a but do involve b and c except under very usual circumstances, such as a highly restrictive partial index.

Slide 62 — There are edge cases where using the DB a a file system makes sense, but they’re just that: edge cases. In the vast majority of cases, it’s much better to store a reference to the large object on the file system in the database. In the case of task queues, most task queue managers have terrible operational characteristics (such as, they poll the queue on a very fast duty cycle).

Slide 63 — I’ve never noticed any real-life speedup from that, but it certainly can’t hurt.

Slide 64 — Prepared statements need a whole separate discussion, but quickly:

(a) The most common case is that the driver does a PARSE / BIND / EXECUTE on the unnamed prepared statement. This simply breaks a single operation up into three with no benefit.

(b) It’s not automatically true that large, expensive-to-parse statements benefit from being turned into prepared statements. The reduction in parse time has to be balanced against the fact that the statement will only be planned once, and against generic rather than specific values, so the actual plan may be suboptimal, potentially very suboptimal, for the actual parameters being executed. It’s not a sure win.

(c) The one case in which a prepared statement is 100% guaranteed to be better is a large number of identical (except for values) INSERTs.

Thanks again for your comment!

Gregory Smith at 15:30, 31 January 2012:

Since it’s inevitable that all corrections will themselves need corrections, I’ll add that since PostgreSQL 8.3 the amount of disk space used by the pg_xlog WAL is normally bounded by (2 + checkpoint_completion_target) * checkpoint_segments + 1 16MB files. That is effectively an upper bound of 3 * checkpoint_segments in most cases. Since that bound isn’t once archiving fails, though, this is in practice not the important part. Making sure you notice when archiving/replication fails is.

The checkpoint parameters do alter the total amount of I/O done by the system. The most commonly accessed pages in memory will only be written once per checkpoint. As you get into gigabytes of sharded_buffers, altering the time between checkpoints can be a large and measurable impact on the total amount of writes. You can extract snapshots from pg_stat_bgwriter to prove that, and I can normally shave at least an average of 1MB/s of random I/O off a server by tuning here. That doesn’t sound like much, but the sort of completely random I/O checkpoint writes do are pretty expensive. And despite the best attempts so far, they are still clustered closer together than we’d like.

I’m in complete agreement with Christophe on prepared statements though. The possibility of query plan regressions is much higher than the odds you’ll see a performance gain from doing less query planning. Eliminating them is one of the most common application level fixes I end up applying. The ability of the query planner to identify MCVs and estimate selectivity is gone once you’re preparing, and that’s a big loss.