15:54
Extra columns when doing .distinct() in a Django QuerySet
22 December 2010
tl;dr: If you are doing a .distinct()
query and limiting the results using .values()
or .values_list()
, you may be in for a surprise if your model has a default ordering using the Meta value ordering
. You probably want to clear the ordering using .order_by()
with no parameters.
If a model is ordered, either by .order_by()
on the QuerySet or a Meta ordering
value, it will always include that field in the QuerySet. This is true even if the query uses .distinct()
. To quote the documentation:
Any fields used in an
order_by()
call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction withdistinct()
.
(The documentation as written implies that is only problem with related models, but as we’ll see, it’s a problem in general. A documentation patch is probably in order here.)
By way of illustration, let’s assume you have the following models:
from django.db import models
class Publisher(models.Model):
name = models.TextField()
class Meta:
ordering = [ 'name', ]
class Book(models.Model):
title = models.TextField()
topic = models.TextField()
publisher = models.ForeignKey(Publisher)
class Meta:
ordering = [ 'title', ]
And we create some rows:
pub = Publisher(name="Strange But True Publications")
pub.save()
And some books:
book1 = Book(title="New Topics in Industrial Meringue Production",
topic="Cooking",
publisher=pub)
book1.save()
book2 = Book(title="Your Chicken's First Song Book",
topic="Animal Husbandry",
publisher=pub)
book2.save()
Now, we want to get the list of IDs of the publishers, and we’re using the cool optimization that I described earlier, with the optimization a commenter suggested (thanks!):
>>> q = Book.objects.values_list('publisher_id', flat=True).distinct()
>>> print q
[1, 1]
Um, wait. That’s not right. Why would it return 1
twice when we said .distinct()
? Let’s look at the SQL (you are doing a tail -f
on the PostgreSQL logs while you develop, right?):
LOG: statement: SELECT DISTINCT "x_book"."publisher_id", "x_book"."title" FROM "x_book" ORDER BY "x_book"."title" ASC LIMIT 21
And there we have it. It includes the title
field in the query, even though it doesn’t return it. Since the DISTINCT
thus applies to both, we have two distinct rows, rather than one.
The fix, fortunately, is easy; just clear the ordering with a .order_by()
without any parameters:
>>> q = Book.objects.values_list('publisher_id', flat=True).distinct().order_by()
>>> print q
[1]
And the query:
LOG: statement: SELECT DISTINCT "x_book"."publisher_id" FROM "x_book" LIMIT 21
There are 3 comments.
Adrian Klaver at 09:40, 23 December 2010:
I appreciate this Django series. I am in the process of learning the framework and the tips are proving invaluable. Keep them coming.
Aamir Adnan at 00:54, 27 March 2011:
Excellent solution.. i have spent two days on resolving this issue but all in vain, most of the sites did explain this problem even the documentation also explain this issue but none of them provide the solution.. :) thanks you saved my 3rd day :p
Phoebe Bright at 05:56, 6 July 2011:
Great tip – saved me a lot of grief!
Thanks.