postgresql when it's not your job

15:54

Extra columns when doing .distinct() in a Django QuerySet

22 December 2010

tl;dr: If you are doing a .distinct() query and limiting the results using .values() or .values_list(), you may be in for a surprise if your model has a default ordering using the Meta value ordering. You probably want to clear the ordering using .order_by() with no parameters.


If a model is ordered, either by .order_by() on the QuerySet or a Meta ordering value, it will always include that field in the QuerySet. This is true even if the query uses .distinct(). To quote the documentation:

Any fields used in anorder_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct().

(The documentation as written implies that is only problem with related models, but as we’ll see, it’s a problem in general. A documentation patch is probably in order here.)

By way of illustration, let’s assume you have the following models:

from django.db import models

class Publisher(models.Model):
    name = models.TextField()

    class Meta:
        ordering = [ 'name', ]

class Book(models.Model):
    title = models.TextField()
    topic = models.TextField()
    publisher = models.ForeignKey(Publisher)

    class Meta:
        ordering = [ 'title', ]

And we create some rows:

pub = Publisher(name="Strange But True Publications")
pub.save()

And some books:

book1 = Book(title="New Topics in Industrial Meringue Production",
             topic="Cooking",
             publisher=pub)
book1.save()

book2 = Book(title="Your Chicken's First Song Book",
             topic="Animal Husbandry",
             publisher=pub)
book2.save()

Now, we want to get the list of IDs of the publishers, and we’re using the cool optimization that I described earlier, with the optimization a commenter suggested (thanks!):

>>> q = Book.objects.values_list('publisher_id', flat=True).distinct()
>>> print q
[1, 1]

Um, wait. That’s not right. Why would it return 1 twice when we said .distinct()? Let’s look at the SQL (you are doing a tail -f on the PostgreSQL logs while you develop, right?):

LOG:  statement: SELECT DISTINCT "x_book"."publisher_id", "x_book"."title" FROM "x_book" ORDER BY "x_book"."title" ASC LIMIT 21

And there we have it. It includes the title field in the query, even though it doesn’t return it. Since the DISTINCT thus applies to both, we have two distinct rows, rather than one.

The fix, fortunately, is easy; just clear the ordering with a .order_by() without any parameters:

>>> q = Book.objects.values_list('publisher_id', flat=True).distinct().order_by()
>>> print q
[1]

And the query:

LOG:  statement: SELECT DISTINCT "x_book"."publisher_id" FROM "x_book" LIMIT 21

Adrian Klaver at 09:40, 23 December 2010:

I appreciate this Django series. I am in the process of learning the framework and the tips are proving invaluable. Keep them coming.

Aamir Adnan at 00:54, 27 March 2011:

Excellent solution.. i have spent two days on resolving this issue but all in vain, most of the sites did explain this problem even the documentation also explain this issue but none of them provide the solution.. :) thanks you saved my 3rd day :p

Phoebe Bright at 05:56, 6 July 2011:

Great tip – saved me a lot of grief!

Thanks.