14 December 2010
This is a follow-up to the previous post, in which we talked about ways of handling huge result sets in Django.
Two commenters (thanks!) pointed out that psycopg2 has built-in support for server-side cursors, using the
name option on the .cursor() function.
To use this in Django requires a couple of small gyrations.
First, Django wraps the actual database connection inside of the
django.db.connection object, as property
connection. So, to create a named cursor, you need:
cursor = django.db.connection.connection.cursor(name='gigantic_cursor')
If this is the first call you are making against that connection wrapper object, it’ll fail; the underlying database connection is created lazily. As a rather hacky solution, you can do this:
from django.db import connection
if connection.connection is None:
cursor = connection.cursor()
# This is required to populate the connection object properly
cursor = connection.connection.cursor(name='gigantic_cursor')
You can then iterate over the results using the standard iterator or
cursor.fetchmany() method, and that will grab results in from the server in the appropriate chunks.
13 December 2010
tl;dr: Don’t use Django to manage queries that have very large result sets. If you must, be sure you understand how to keep memory usage manageable.
Read the rest of this entry »
6 November 2010
Christmas just came early for me. psycopg2.3, now in beta, includes named tuples as return values from queries.
If you are tired of writing result, and would much prefer to write result.column_name, you now can.
28 October 2010
Yesterday, I commented on a post about how widespread uptake on 9.0 replication will be. I disagreed with the assessment that “users” (by which we mean small installations of PostgreSQL, defined however you care to) will not be interested in 9.0’s hot standby/streaming replication.
Ultimately, of course, we’ll find out. But I strongly feel that 9.0’s streaming replication will be a big deal for small PostgreSQL installations… indeed, I think it will be a much bigger deal for them than big ones.
First, I’ll happily exclude hobbyist and developer installs of PostgreSQL. I don’t back up my development PG databases more often than once a day, and I certainly don’t have any kind of replication set up for them (unless that’s what I’m developing). The important part, the code, lives in a DVCS, and if I had to reconstruct the db from scratch, no big deal… indeed, I do it all the time.
I’m talking about small installations of PG that are used to as authoritative records of business-critical information: Web site transactions, for example. The fact that, traditionally, these users of PG haven’t been all that into replication solutions has nothing to do with their actual need for replication; instead, it has to do with the solutions they had available.
- Small installations generally don’t have the time and expertise to search out third-party solutions, or the budget to pay an expert to do so. If it doesn’t come in the base RPM or tarball, they’re not interested in it.
- The third-party solutions that are available are all complex and fiddly to set up. I’m certainly not bashing Slony, for example; it’s a great tool. But it is not something that a casual DBA wants to take on.
So, they make do with
pg_dumpall and hope for the best… and then call someone like us if that doesn’t work.
But it is fallacious to conclude that because they are not using replication right now, they have no use for it. Ask a corner liquor store if they could afford to have an entire day’s worth of electronic transactions just vanish; I’ll bet a bottle of something cheap that they carry that the answer would be, “Of course not.” It might not be worth a $15,000 consulting engagement to set it up, but it’s worth something, possibly quite a bit.
Indeed, this is one of the things that’s driving adoption of “cloud computing”: The (sometimes erroneous) idea that the cloud provider is managing disaster recovery and high availability for you, included in the cost of your monthly service charge.
tl;dr: PG 9.0’s streaming replication will be widely adopted by smaller installations that use PG to manage business-critical data, specifically because it makes it something a casual DBA can do, something we’ve not had before with PG.
27 October 2010
Over at the Command Prompt blog, Joshua Drake makes a (probably deliberately) provocative point about “users” not wanting replication, as opposed to “customers” who do. I’ll confess I’m not 100% sure about his distinction between “users” and “customers,” so I’ll just make something up: Users are the people sitting in front of the application, entering data, buying shoes, or doing whatever it is that the database enables; customers are the CIOs, CTOs, Directors of Engineering, and the other people who make purchasing decisions.
Yes, Command Prompt customers want replication. Yes, PostgreSQL Experts, EntepriseDB and OmniTI customers want replication. However, customers are not users. At least not in the community sense and the users in the community, the far majority of them do not need or want replication. A daily backup is more than enough for them.
Well, yes, as far as it goes, he’s absolutely right. Users don’t need or want replication. They don’t need or want PostgreSQL, for that matter; VSAM, flat files, or a magic hamster would be fine with them, too, as long as the data that comes out is the data that goes in.
But for how many users, really, is “It’s OK if you lose today’s data, gone, irretrievably, pffft, yes?” really an acceptable answer? Very few. Very very few, and getting fewer all the time. One of the strongest pushes behind moving services into the “cloud” (i.e., external hosting providers of various kinds) is that they provide near-constant recovery and fault-tolerance. Users don’t care if their data is protected by hardware-level solutions like SANs, or software-level solutions like replication, as long as it is protected.
Users who profess not to care about this are either not putting authoritative data into a database, or just haven’t had the inevitable data disaster happen to them yet.
For me, the biggest feature of PostgreSQL’s 9.0 replication is that it is much, much easier to set up than any previous solution. Slony is a heroic project, and has lots of happy customers using it extensively, but it is notoriously fiddly and complex to set up.
Like a lot of technologies, replication hasn’t been a demand for a lot of PostgreSQL implementation because the cost didn’t seem worth the payoff. 9.0 brings the implementation cost way, way down, and thus, we’ll start seeing a lot more interest in putting replication in.
Of course, do the daily backups, too.
What does this mean?
It clearly means something along the lines of, “Can handle lots of transactions per unit time,” but how many?
I mean, WordPress with WP-SuperCache is “web scale” if all that is meant is, “Can be used to implement a high volume site,” but I assume those who are touting something as “web scale” are aiming higher than that.
Anyone care to offer a quantitative definition of this term?
25 October 2010
A well-known issue that can come up with Django sites running on PostgreSQL is that connections in “Idle in Transaction” state can pile up. There’s a relatively straight-forward fix, but ultimately, it’s due to a bug in Django’s transaction management, at least when PostgreSQL is the back-end.
Let’s run through it.
Read the rest of this entry »
26 September 2010
If you are having them, the fix is here.
18 July 2010
Here are the slides for my talk at PGXPUG Day OSCON 2010.
3 June 2010
The slides from my talk, Introduction to PostgreSQL are available here.