20:10

Very Large Result Sets in Django using PostgreSQL

13 December 2010

tl;dr: Don’t use Django to manage queries that have very large result sets. If you must, be sure you understand how to keep memory usage manageable.

One of the great things about modern interpreted, garbage-collected languages is that most of the memory management happens behind the scenes for you. Unfortunately, sometimes, the stage equipment comes crashing through the backdrop in the middle of the performance.

In Django, this frequently happens when manipulating tables that contain a very large number of rows. Here are some tips on how to not end up with the “behind the scenes” machinery landing in the audience’s lap.

For purposes of this discussion, let’s define “very large” as being bigger than is comfortable to keep in memory for the appplication.

When Django executes a query and reads the results, memory is being taken up several places to hold the results of the query:

On the database server, it needs to keep around structures holding the result of the query. Most database servers are good about not keeping any more rows of the result in memory than they absolutely have to, and in any event, it’s pretty much out of Django’s control what the database server stores. So, we shall trust the database to do the right thing, and move on.
Inside of the Django application, some set of the rows that will ultimately be the result of the query need to be stored while Django processes them.
The Django QuerySet object can (although it does not always) cache some of the results of the query as Model objects.
And, of course, the application might hang on to some of the objects that come back (for example, for display on the web page). Of course, this is directly under the control of the application author. We’ll trust you to do the right thing, and move on.

First, let’s talk about Django’s caching in the QuerySet object.

Query Set Caching

Django’s QuerySet objects serve two roles: They’re data structures representing an SQL query, and an API to access the results of the query. There’s no explicit “do this query now, please” operation in Django (although some operations have that as, shall we say, a strong implication); by and large, Django waits until you try to get the results of a query before executing it. So, until you get the first object out of the query set, Django won’t have even executed the query.

Django also has a caching mechanism built into QuerySet. This cache stores the objects that are manufactured from the rows as they come back from the database, so that multiple accesses to the same object from the same QuerySet will return the cached object instead of a new copy.

Note, however, the emphasis on from the same QuerySet. A surprisingly large number of operations clone the QuerySet before operating on it. For example:

qs = ExampleModel.objects.filter(name='Fred')
x = qs[2]
x = qs[2]

This will do two queries. qs[2], under the hood, clones the query set, applies a limit of [2:3] to it, executes the query, returns the resulting object, and throws the limited QuerySet away. Slicing does exactly the same thing.

However, there is an exception. If you do this:

qs = ExampleModel.objects.filter(name='Fred')
list(qs)
x = qs[2]
x = qs[2]

… the access pattern will be very different. list(qs) forces the evaluation of the query set, so Django will send the query to the database server, and populate the QuerySet (and its cache) with the result. Then, the qs[2] operaitons don’t copy the QuertSet; it just hits the cache.

Note, though, that this came at the expense of retrieving every row that matched the query from the server, and creating objects for it. If you force the QuerySet to be evaluated, Django creates objects for everything that matches the query.

When you iterate over a QuerySet, the behavior is slightly different. The QuerySet cache is always built from the first object that matches the query on up; it’s not sparse (for example, you’ll never have the situation where qs[4] and qs[1000] are in the cache, but the objects between them aren’t). As you iterate over a QuerySet, if the cache is not already populated, Django grabs the rows in chunks (currently hard-coded to be 100) and fills the cache ahead of the iterator. This does mean that if you do a query, then only iterate over the first few elements, the cache doesn’t fill up with stuff you are never going to look at.

You can defeat the caching by using .iterator(). For example:

qs = ExampleModel.objects.filter(name='Fred')
for x in qs.iterator():
   do_something_wonderful(x)

This will execute the query, and return each resulting object back, but without filling the cache. (It also won’t return cached objects if they already exist; .iterator() forces a reexecution of the query.) As the Django documentation says, this can be handy if it is a huge result set.

So, let’s say you for some reason want to process 100 million rows. You know for sure that you won’t be able to hold all 100 million Model objects in memory, so you dutifully do:

qs = GiganticTableModel.objects.all()
for giant in qs.iterator():
    # And BANG, you get an out of memory exception right here.

But what happened? Why did you run out of memory before you even saw a single object out of .iterator()?

Daddy, Where Do Model Objects Come From?

Let’s take a moment and trace down the code path that gets executed here:

Creating the QuerySet doesn’t touch the database at all, as noted above.
After a certain amount of fussing around, .iterator() calls the underlying backend query machinery to perform the query.
The backend machinery executes the query, and creates an iterator over the resulting rows. That iterator (in this case) grabs a chunk of rows at a time using .fetchmany(), and returns them one at a time. (At it happens, that chunk is hardcoded at 100 rows.)
That iterator is called by the actual iterator returned by .iterator(), so the iteration (pfew!) proceeds as: Call to get a row (which refills if the last grab of 100 is exhausted), create a new object, and return it up. Create an object from that row, return it to the caller.

So, why are we getting an out of memory condition? Even though there are 100 million rows in the result, there should only be 100 in memory at any one time, right?

Sadly, wrong. At the moment that the backend machinery executes the query, all 100 million rows are returned by the database server at once.

To quote the psycopg2 documentation:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

This is true even if you do a .fetchone() or .fetchmany(), not just a .fetchall(). And there’s no way, while staying entirely within the standard Django QuerySet machinery, to change this behavior.

So, what do we do?

“Doctor, It Hurts When I Do That.”

“So, don’t do that.”

If at all possible, don’t process very large result sets directly in Django. Even setting aside the memory consumption, it’s a horribly inefficient use of pretty much every part of the toolchain. Much more appealing options include:

Use .update() to push the execution into the server.
Use a stored procedure or raw SQL.

Modern database servers are designed to crunch large result sets; leave the data on the server and do it there.

Take Smaller Bites

If there is a way of partitioning the data up into smaller chunks, do that. (For example, processing by day, or ID range.) Although I wouldn’t exactly call it “best practice,” you could iterate through the rows by using ranges of the primary key, assuming a standard Django serial integer PK:

i = 0
while True:
    qs = GiganticTableModel.objects.filter(pk__gte=i*1000, pk__lt=(i+1)*1000)
    try:
        for giant in qs:
            do_something_wonderful(giant)
    except GiganticTableModel.DoesNotExist:
        break

    i += 1

There’s also an example here of constructing an iterator that does much the same thing.

Use a Database-Side Cursor

The way that databases really deal with this problem is cursors. Not the Python DBI cursor, in this case; server-side cursors are a structure which holds the result of a query and allows the client to read portions of it at will without having the whole thing shipped across.

They’re wonderful, and Django should use them. It doesn’t. However, you can, using direct SQL.

To create a cursor in PostgreSQL in the server, first, we need to have a transaction open. For the full details about Django transaction management, check out some of my earlier blog posts. This is required because the type of cursors we’ll be using will only persist for the duration of the transaction.

Now, the SQL sequence looks something like this. Instead of saying:

SELECT * FROM app_gigantictablemodel;

We say:

DECLARE gigantic_cursor BINARY CURSOR FOR SELECT * FROM app_gigantictablemodel;

(The BINARY keyword allows it to use the more-efficient binary protocol between the database server and the application.)

Then, to get results, we can just say:

FETCH 1000 FROM gigantic_cursor;

… or however many rows we want to get.

And then, we can just iterate over them (of course, we’re getting the rows as rows, rather than objects):

cursor = connection.cursor()
    # Remember that this 'cursor' is a different thing than the server-side cursor!
cursor.execute("DECLARE gigantic_cursor BINARY CURSOR 
                    FOR SELECT * FROM app_gigantictablemodel")

while True:
    cursor.execute("FETCH 1000 FROM gigantic_cursor")
    rows = cursor.fetchall()

    if not rows:
        break

    for row in rows():
        ...

Now, there’s something that should work great, but doesn’t. In 1.2, Django introduced raw SQL queries that return a RawQuerySet. So, one could in theory do this:

qs = GiganticTableModel.objects.raw("FETCH 1000 FROM gigantic_cursor")

Except we get an exception:

Raw queries are limited to SELECT queries. Use connection.cursor directly for other types of queries.

Presumably to guard against weird errors, raw queries do a hardcoded check against the query string, making sure it starts with SELECT. It would be nice if this were liberalized to allow FETCHes.

So, if you must, you can process gigantic result sets in Django. But, ideally, you should design your application to make it unnecessary. If you must process large result sets, do it on the database server; that’s what it’s there for.

There are 9 comments.

Dan McGee at 14:49, 14 December 2010:

You can also use the extras in psycopg2 to get a server-side rather than client side cursor, hiding the DECLARE/FETCH stuff away and doing some mapping for you:

cursor = conn.cursor(‘patents_cursor’, cursor_factory=psycopg2.extras.DictCursor)
cursor.execute(“SELECT * FROM app_gigantictablemodel”)
for row in cursor:

http://initd.org/psycopg/docs/connection.html#connection.cursor
http://initd.org/psycopg/docs/extras.html#dictionary-like-cursor

Xof at 14:58, 14 December 2010:

Very true! You do have to do one little bit of hoop-jumping for this to work with Django. The standard django.db.connection object’s cursor() method doesn’t accept the extra psycopg2 arguments. I’ll follow up with a short blog post on how to do this with Django.

Taifu at 15:32, 14 December 2010:

Another approach:

import uuid
from django.db import connection

cursor = connection.connection.cursor(str(uuid.uuid4()).replace(“-“, “”))
cursor.execute(*GiganticTableModel.objects.all().query.as_sql())
slice = 5000
d = 0
while True:
for counter, rec in enumerate(cursor.fetchmany(slice)):
# Use rec (is a tuple!)
pass
if counter != slice – 1:
break

Also this works! :-)

Taifu at 15:36, 14 December 2010:

Please, format my code and remove “d=0”

And add “great post!” at the end ;-)

Thanks!

Taifu at 15:38, 14 December 2010:

Last thing! You need a character in front of uuid (I wrote the code searching in my poor brain!).

cursor = connection.connection.cursor(“cur” + str(uuid.uuid4()).replace(”-”, “”))

Using Server-Side PostgreSQL Cursors in Django at 17:02, 14 December 2010:

[…] December 2010This is a follow-up to the previous post, in which we talked about ways of handling huge result sets in […]

Brian Luft at 11:47, 15 December 2010:

This is a great writeup – thanks! This coincides nicely with what I’m planning to present at the LA Django meetup next week and I plan to borrow a few of your observations and credit where due.

Gene at 13:25, 16 December 2010:

Nicely written, thanks.

SmileyChris at 19:57, 7 March 2011:

Here’s the (currently still open) ticket for making raw more flexible: http://code.djangoproject.com/ticket/14733