The Build

19:52

Only NATURAL: JOIN clauses vs WHERE conditions

15 October 2009

When I learned SQL, I was taught that the way to do a join was with a WHERE clause. For example, if you have orders:

CREATE TABLE orders (
   order_id bigserial PRIMARY KEY,
   customer_id bigint REFERENCES customers(customer_id),
   ...
)

and order lines:

CREATE TABLE order_lines (
   ordinal integer NOT NULL,
   order_id bigint REFERENCES orders(order_id),
   item_id bigint REFERENCES items(item_id),
   quantity decimal(10,2) NOT NULL CHECK(VALUE >= 0),
   ...
   PRIMARY KEY(order_id, ordinal)
)

… to get a list of customer_id, item_id pairs:

SELECT orders.customer_id, order_lines.item_id
   FROM orders, order_lines
   WHERE orders.order_id = order_lines.order_id;

And that works great. However, there is another syntax, which is to use a JOIN clause in the FROM list:

SELECT orders.customer_id, order_lines.item_id
   FROM orders NATURAL JOIN order_lines;

NATURAL JOIN, in the words of the wonderful PostgreSQL documentation, “is shorthand for a USING list that mentions all columns in the two tables that have the same names,” so it’s equivalent to:

SELECT orders.customer_id, order_lines.item_id
   FROM orders JOIN order_lines USING(order_id);

which in turn is equivalent to:

SELECT orders.customer_id, order_lines.item_id
   FROM orders JOIN order_lines ON orders.order_id=order_lines.order_id;

NATURAL JOIN is nifty, but it does have a signficant danger. Suppose later we decided to add notes to orders:

ALTER TABLE orders ADD COLUMN note text;

And to order lines, too:

ALTER TABLE order_lines ADD COLUMN note text;

Suddenly, our natural join is much less, well, natural; it’s equivalent to:

SELECT orders.customer_id, order_lines.item_id
   FROM orders JOIN order_lines ON (orders.order_id=order_lines.order_id)
      AND (orders.note=order_lines.note);

… which is almost certainly not what you want. In absence of amazingly good naming discipline, NATURAL is a bit too surprising.

That being said, I’ve gotten very fond of USING and ON. I find it very intuitive to separate join conditions from other filtering conditions. For example, if we only wanted order lines with a quantity greater than zero:

SELECT orders.customer_id, order_lines.item_id
   FROM orders JOIN order_lines USING(order_id)
   WHERE order_lines.quantity > 0;

I like the separation of the join condition that assembles the product of the input tables from the filtration conditions, as I tend to think of those independently when designing a query.

This starts coming apart when you have extremely complicated join conditions, or are joining over many tables (more than three or four), it’s often clearer to move the join conditions into the WHERE clause. Otherwise, see if separating out the join conditions from the other conditions makes your queries easier to read; it did for me.

melendezje at 14:14, 16 October 2009:

Excelent, until today i couln’t find another better way to explain that…

Eric Howe at 15:18, 16 October 2009:

I tend to stuff all my join conditions in the WHERE clause unless I have to put them in FROM (i.e. I’m doing some LEFT OUTER JOINS). The only thing I miss from using Sybase (ack! ptooi!) is their convenient LEFT OUTER JOIN syntax for WHERE clauses: “WHERE A.a *= B.b” means “A LEFT OUTER JOIN B ON (A.a = B.b)”. The full JOIN syntaxes seem overly verbose to me.

Xof at 15:33, 16 October 2009:

> The full JOIN syntaxes seem overly verbose to me.

It is entirely a matter of taste, of course. For me, I like having the join conditions segregated out, as I think of them as logically distinct from the “filter” conditions in WHERE. (Of course, from a relational algebra perspective, they’re all the same kind of thing.)

Olly at 15:56, 18 October 2009:

Nice post, but I think you should also try to always specify what type of join you want to use. Makes reading easier, and thats something that the comma-joining doesn’t do.

gregj at 02:12, 19 October 2009:

there might be huge difference in performance, if you move some of the conditions from WHERE into USING, especially with postgresql. Try it someday ;)
Other than that, I do agree with author. Natural joins are evil.

Jeff Davis at 13:28, 19 October 2009:

NATURAL JOIN is great for ad-hoc queries if you have a good and consistent naming scheme. One huge benefit is that it’s an infix operator between two tables, which is very easy to read. JOIN … USING is not prefix, postfix, or infix — it’s a mix of infix and postfix, which is much more awkward.

Jeff Davis at 13:30, 19 October 2009:

gregj,

Are you sure about the performance difference? Can you show an example in 8.4?

CN at 20:09, 24 October 2009:

> there might be huge difference in performance, if you move some of the conditions from WHERE into USING, especially with postgresql.

I would like to know more about this, too.

The PostgreSQL manual reads:

“where condition is any expression that evaluates to a result of type boolean. Any row that does not satisfy this condition will be eliminated from the output.”

This part of documentation gives me the impression that WHERE collects “all” rows from all the selected tables and discard the unwanted rows at the final step. Is my understanding correct? Does this also imply that “JOIN t1 ON” or “JOIN t1 USING” skip the unwanted rows earlier than “WHERE” does?