postgresql when it's not your job

20:13

ORMs and Their Discontents

4 November 2009

I love object-oriented programming. The first time I ever encountered an OO language or framework (Object Pascal and MacApp, thank you for asking), it was as if the heavens opened. It simply made sense to me as a way of structuring programs, and I’ve been wedded to it ever since. (As it turns out, if you do the right paperwork, you can marry a concept.)

So, I think object-oriented programming is the bee’s knees. And, in this post, I’m going to tell you to not use it.

I was recently working on a Django application, using PostgreSQL as the database. (This application has been heavily anonymized.) Among many other things, it has a table that is, in essence, a work flow queue. The queue is a list of orders, based on their (integer) primary key, order_pk. Periodically, a task comes through, reads the queue into a Python list, and runs down it, calling a method on each object:

for queue_entry in order_queue
    order = Orders.objects.get(pk=queue_entry.order_pk)
    order.allocate_inventory()

What does allocate_inventory() do? It calls a stored procedure, and returns:

def allocate_inventory(self):
    cursor = connection.cursor()
    cursor.execute("""
        SELECT allocate_inventory(%s)
    """, [self.pk])

Lovely! Let’s just look at the database traffic it creates and… oh, gross.

The program is reading in and instantiating an entire order object, so it can get the primary key out of it. But the way it was able to instantiate the object was because it had the primary key. In this application, order objects have lots and lots of fields, so it’s dragging all of those fields across the wire to build the object, only to use the one field we already knew before we even creted the object.

If everything was in memory, no problem. Instatiating a Python object isn’t free, but it’s no big deal, and this keeps everything nicely encapsulated. But this is one of the dangers of ORMs: Things that would be cheap in an in-memory world suddenly become much pricier when you have to go out to the database, and it can be somewhat non-obvious when you have to go out to the database, how many times, and in what particular way.

(In one particularly pathological example, in this same application, the Django ORM was doing seven calls to the database in order to populate one pop-up menu … which had exactly one entry in it.)

The fix, of course, is to pass the primary key around rather than object:

for queue_entry in order_queue
    allocate_inventory(queue_entry.order_pk)

@staticmethod
def allocate_inventory(order_pk):
    cursor = connection.cursor()
    cursor.execute("""
        SELECT orders.allocate_inventory(%s)
    """, [order_pk])

… even though this isn’t nearly as clean from an OO perspective as creating the object.

Whenever you’re developing against kind of ORM (or, really, any database application programming at all), it’s very wise to set all of the logging on the development PostgreSQL server to “log every statement,” and keep a tail -f of the log file going in another window. That way, if an operation that looks like the proverbial “just a few lines of code” generates an unexpectedly large number of database calls, you can address that before it falls over in production.

ORMs can be very powerful, but the typical usage patterns in a nicely structured OO program can interact with them very badly, and generate a lot more database traffic than you’d expect.

Bill Karwin at 21:38, 4 November 2009:

Yes! Kudos for pointing out the disparity in cost between instantiating a plain object versus populating an object from a database. This is one of the hidden costs of using an ORM framework. Thanks for the good blog post.

Marco Mariani at 03:53, 5 November 2009:

More generally — whenever you work with an abstraction layer, it is wise to know what’s happening one or two layers below that.
For a report that is generated once per week, we can accept the cost of fifty queries generated with three lines of code.
Otherwise, it’s good to have a tool to analyze logs and reports “this SELECT * FROM users has been repeated 5 billion times today”. Some query patterns are obvious bugs btw.
Let me recommend the book “Refactoring SQL Applications” and the One True SQLAlchemy :)

Joe at 06:22, 5 November 2009:

Or you could use SQLAlchemy as the ORM, which gives you the option to “echo” the SQL statements that it’s sending to the DBMS (and it has other nice things, like being able to control the object “loading” strategy).

myeisha at 14:19, 5 November 2009:

A well written article, yet the point itself is void. While I do not know all ORMs available, the ones I actively use provide a high level interface similar to Orders.objects – which I assume to load all orders from the database – and a near-direct SQL access layer to execute “sanitized” queries (sane from an OO point of view, such as Hibernate “select order.pk from Order order”, which will return only a list of all orders’ primary keys, *without* instatiating any objects at all, except the list of course).

Disclaimer: I have written a few viable and actively used ORMs, and I use them happily and heavily with no noticable performance overhead in applications.

Xof at 14:42, 5 November 2009:

> Orders.objects – which I assume to load all orders from the database

In the Django ORM, Model.objects provides (among other things) the API to the query interface. It’s only retrieving the single object.

But that’s not the point of the article, of course. Even if the app was only retrieving the primary key of the one Order object (which Django will let you do), that’s still a completely redundant database hit.

Robert at 11:43, 6 November 2009:

I think you article brings up a good point and in particular I agree with one of the other reader comments: you usually need to understand a layer or so beneath an abstraction layer to make sure it’s not doing anything silly.

So, if all you want is a PK .. and you have it already, then no need to do anything…I mean AT ALL.

However, if as you’ve said, you really want to itterate over a list “of objects” and send each a message, then you have no choice but to instantiate an object, otherwise to what do you send your message.

In such a case you want to get the objects as efficiently as possible (ideally 1 RPC to get entire list ). Whether or not you only need certain attribute access ( data fields ) and not others is likely of little concern compared to avoiding all the RPC’s.

Typically, the only time you need to be concerned about the “dragging the attributes of an object” to instantiate, is if you are dealing with millions of them or if you have some attributes which are especially large … like an image field. You solve these issues in different way … encapsulating things like images in their own object instead of embedding, using another layer of indirection in your model to narrow your items of interest in the list, etc.

These kinds of issues are addressed especially well using an object database like db4o or Versant because you are inherently moving objects around and must do it efficiently. There is zero mapping …. your domain model is your database schema.

German at 02:25, 9 November 2009:

I would like to point out that even with happy users like myeisha ORM still has well identified issues that you can circumvent with object databases. I recommend this article:
http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx
After all, isn’t it obvious that object based persistence is the ideal way to deal with our objects? The way in which I measure suitability is by determining to what degree my persitence framework needs to change my object representation (more changes, more work) and by this measure there’s nothing better than object databases (not even cloud based systems like big table).

Best!