The Build

23:36

When LIMIT attacks

18 November 2014

One common source of query problems in PostgreSQL results an unexpectedly-bad query plan when a LIMIT clause is included in a query. The typical symptom is that PostgreSQL picks an index-based plan that actually takes much, much longer than if a different index, or no index at all, had been used.

Here’s an example. First, we create a simple table and an index on it:

xof=# CREATE TABLE sample (
xof(#   i INTEGER,
xof(#   f FLOAT
xof(# );
CREATE TABLE
xof=# CREATE INDEX ON sample(f);
CREATE INDEX

And fill it with some data:

xof=# INSERT INTO sample SELECT 0, random() FROM generate_series(1, 10000000);
INSERT 0 10000000
xof=# ANALYZE;
ANALYZE

Then, for about 5% of the table, we set i to 1:

xof=# UPDATE sample SET i=1 WHERE f<0.05;
UPDATE 499607
xof=# ANALYZE;
ANALYZE

Now, let’s find all of the entires where i is 1, in descending order of f.

xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f DESC;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=399309.76..401406.04 rows=838509 width=12) (actual time=1415.166..1511.202 rows=499607 loops=1)
   Sort Key: f
   Sort Method: quicksort  Memory: 35708kB
   ->  Seq Scan on sample  (cost=0.00..316811.10 rows=838509 width=12) (actual time=1101.836..1173.262 rows=499607 loops=1)
         Filter: (i = 1)
         Rows Removed by Filter: 9500393
 Total runtime: 1542.529 ms
(7 rows)

So, 1.5 seconds to do a sequential scan on the whole table. So, just getting the first 10 entries from that should be much faster, right?

xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f DESC LIMIT 10;
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..277.33 rows=10 width=12) (actual time=12710.612..12710.685 rows=10 loops=1)
   ->  Index Scan Backward using sample_f_idx on sample  (cost=0.43..23218083.52 rows=838509 width=12) (actual time=12710.610..12710.682 rows=10 loops=1)
         Filter: (i = 1)
         Rows Removed by Filter: 9500393
 Total runtime: 12710.714 ms
(5 rows)

Oh. 12.7 seconds. What happened?

PostgreSQL doesn’t keep correlated statistics about columns; each column’s statistics are kept independently. Thus, PostgreSQL made an assumption about the distribution of values of i in the table: they were scattered more or less evenly throughout. Thus, walking the index backwards meant that, to get 10 “hits,” it would have to scan about 100 index entries… and the index scan would be a big win.

It was wrong, however, because all of the i=1 values were clustered right at the beginning. If we reverse the order of the scan, we can see that was a much more efficient plan:

xof=# EXPLAIN ANALYZE SELECT * FROM sample WHERE i=1 ORDER BY f  LIMIT 10;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..277.33 rows=10 width=12) (actual time=0.029..0.046 rows=10 loops=1)
   ->  Index Scan using sample_f_idx on sample  (cost=0.43..23218083.52 rows=838509 width=12) (actual time=0.027..0.044 rows=10 loops=1)
         Filter: (i = 1)
 Total runtime: 0.071 ms
(4 rows)

So, what do we do? There’s no way of telling PostgreSQL directly to pick one plan over the other. We could just turn off index scans for the query, but that could well have bad side effects.

In this particular case, where a predicate (like the WHERE i=1) picks up a relatively small number of entries, we can use a Common Table Expression, or CTE. Here’s the example rewritten using a CTE:

xof=# EXPLAIN ANALYZE
xof-# WITH inner_query AS (
xof(#    SELECT * FROM sample WHERE i=1 
xof(# )
xof-# SELECT * FROM inner_query ORDER BY f  LIMIT 10;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=351701.16..351701.18 rows=10 width=12) (actual time=1371.946..1371.949 rows=10 loops=1)
   CTE inner_query
     ->  Seq Scan on sample  (cost=0.00..316811.10 rows=838509 width=12) (actual time=1168.034..1244.785 rows=499607 loops=1)
           Filter: (i = 1)
           Rows Removed by Filter: 9500393
   ->  Sort  (cost=34890.06..36986.33 rows=838509 width=12) (actual time=1371.944..1371.944 rows=10 loops=1)
         Sort Key: inner_query.f
         Sort Method: top-N heapsort  Memory: 25kB
         ->  CTE Scan on inner_query  (cost=0.00..16770.18 rows=838509 width=12) (actual time=1168.040..1325.496 rows=499607 loops=1)
 Total runtime: 1381.472 ms
(10 rows)

A CTE is an “optimization fence”: The planner is prohibited from pushing the ORDER BY or LIMIT down into the CTE. In this case, that means that it is also prohibited from picking the index scan, and we’re back to the sequential scan.

So, when you see a query come completely apart, and it has a LIMIT clause, check to see if PostgreSQL is guessing wrong about the distribution of data. If the total number of hits before the LIMIT are relatively small, you can often use a CTE to isolate that part, and only apply the LIMIT thereafter. (Of course, you might be better off just doing the LIMIT operation in your application!)

Robert at 04:11, 19 November 2014:

Excellent article! Thank you:)

Victor at 06:57, 19 November 2014:

Could you explain *why* would “all of the i=1 values clustered right at the beginning”?

xocolatl at 08:57, 19 November 2014:

Another technique is to order by an identity, which prohibits it from using the index:

SELECT * FROM sample WHERE i=1 ORDER BY f+0 DESC LIMIT 10;

Xof at 11:10, 19 November 2014:

Could you explain *why* would “all of the i=1 values clustered right at the beginning”?

The data is constructed that way in the example. We set i to 1 for anything lower than a certain value of f, and then retrieve in reverse order based on f.

In real-life data, there could be a correlation between the field in the ORDER BY and the field in the predicate, or just random distribution that worked out badly in the particular case of the query you’re working on.

Kai at 13:04, 19 November 2014:

For examples like these hinting would be really beneficial. And I know the pg philosophy here (“we don’t like hints, we improve the planner instead”).

At least CTEs are a quite natural way to impose subtle pressure towards a good plan, if necessary.

Norman at 10:03, 20 November 2014:

“PostgreSQL doesn’t keep correlated statistics about columns; each column’s statistics are kept independently.”

This statement is absolutely correct; it may be worth pointing out, though, that one big reason Postgres doesn’t do this is that finding a way to do so efficiently is a Very Hard Problem. Other lesser concerns include figuring out which columns’ correlations are worth tracking.