In the comments to my post about Postgres lateral joins, someone asked about getting the same results with better performance using a GROUP BY
query. Actually you can’t get the same results with a straight GROUP BY
clause, but there are three more ways to write the query that are worth considering. So I’d like to talk first about why (top-level) GROUP BY
doesn’t work, and then look at the other ways to do it (including a subquery GROUP BY
).
Recall that the problem was: for each restaurant, show its info, plus the score of its most recent inspection (if any), plus the number of violations from that inspection. When you hear “most recent” it’s natural to think about GROUP BY
. Maybe you want to write something like this:
SELECT r.id, r.name, MAX(i.inspected_at)
FROM restaurants r
LEFT OUTER JOIN inspections i
ON i.restaurant_id = r.id
GROUP BY r.id
;
The problem is that whenever you want to pick a row using one column but then select additional columns from that same row, GROUP BY
lets you down. In the above query, you can’t add i.score
to the SELECT
list, because that’s an error. You have to do something fancier.
One way to get there is a lateral join, as we’ve seen already. Three other ways are:
Use GROUP BY
in a subquery to get the latest inspection date, then use that date to get the rest of the inspection info.
Use a window function.
Use DISTINCT ON
.
A couple things to keep in mind:
We want to show all the restaurants, even if they have no inspections yet. That’s why I didn’t write the above query FROM inspections GROUP BY restaurant_id
, and that’s why all our queries below are FROM restaurants
with an outer join to something else.
We also want to include a count of the violations from whatever inspection we choose. I’ve left that out of all my queries because it mostly just adds noise, but remember that whatever solution we consider must be compatible with that requirement. It must be possible to add one more outer join to violations
and get the count. One approach this rules out is putting a whole subquery into the SELECT
list, because that won’t be available for joining to violations
.
For the first alternative, we might write the query like this:
SELECT r.id, r.name, i.score, i.inspected_at
FROM restaurants r
LEFT OUTER JOIN (
SELECT restaurant_id, MAX(inspected_at) AS max_inspection_date
FROM inspections
GROUP BY restaurant_id) i2
ON i2.restaurant_id = r.id
LEFT OUTER JOIN inspections i
ON i.restaurant_id = r.id
AND i.inspected_at = i2.max_inspection_date
;
Here we’re using GROUP BY
to get the most recent inspection of each restaurant, and then we add a separate join based on that inspection date to get the rest of the inspection info.
I think this is borderline incorrect. There’s no guarantee that inspected_at
is unique. If we have millisecond precision it might be, but what if it’s just a date? Probably no restaurant gets inspected twice in the same day, but it’d be nice if we didn’t have to rely on that assumption. Also pedagogically it’s a bad approach because who knows if your application will have the same guarantees? I’d rather find a pattern that is valid always, not just for this restaurants example.
Also it’s hard to believe this approach is very fast. Indeed, here is the query plan:
Hash Right Join (cost=1162.54..2645.04 rows=1000 width=20)
Hash Cond: ((i.restaurant_id = r.id) AND (i.inspected_at = (max(inspections.inspected_at))))
-> Seq Scan on inspections i (cost=0.00..1256.00 rows=30000 width=16)
-> Hash (cost=1147.54..1147.54 rows=1000 width=16)
-> Hash Right Join (cost=45.79..1147.54 rows=1000 width=16)
Hash Cond: (inspections.restaurant_id = r.id)
-> GroupAggregate (cost=0.29..1078.29 rows=1000 width=12)
-> Index Only Scan using idx_inspections_rest_and_date on inspections (cost=0.29..918.29 rows=30000 width=12)
-> Hash (cost=33.00..33.00 rows=1000 width=8)
-> Seq Scan on restaurants r (cost=0.00..33.00 rows=1000 width=8)
On my machine I get query times around 20ms, so it’s more than twice as slow as the lateral join. We already have an index on inspections
for restaurant_id
+ inspected_at
, so I don’t think there’s an easy way to speed things up here.
One great feature in Postgres is window functions. This lets you compute aggregates (and more!) without turning your whole query into a GROUP BY
. I’ve written about window functions a bit here, but it’s a big topic, so if they’re new to you, the Postgres docs are the place to start. With window functions, we can get what we want like this:
SELECT r.id, r.name, i.score, i.inspected_at
FROM restaurants r
LEFT OUTER JOIN (
SELECT *,
row_number() OVER (PARTITION BY restaurant_id ORDER BY inspected_at DESC) AS row_number
FROM inspections) i
ON i.restaurant_id = r.id
AND i.row_number = 1
;
This approach is more correct than the previous one. There are no risky assumptions about unique inspection dates. But it performs a little worse. Here is the query plan:
Merge Right Join (cost=0.56..4096.38 rows=1000 width=20)
Merge Cond: (i.restaurant_id = r.id)
-> Subquery Scan on i (cost=0.29..4030.72 rows=150 width=16)
Filter: (i.row_number = 1)
-> WindowAgg (cost=0.29..3655.72 rows=30000 width=20)
-> Index Scan using idx_inspections_rest_and_date on inspections (cost=0.29..3130.72 rows=30000 width=20)
-> Index Scan using restaurants_pkey on restaurants r (cost=0.28..61.27 rows=1000 width=8)
Results for me take about 25ms. I’ve gotta say I love window functions, but it looks like in this case at least lateral join is a better choice.
There is a special feature in Postgres called DISTINCT ON
, which is another way to get grouping-like behavior without a GROUP BY
clause. If you’ve heard of DISTINCT
but not DISTINCT ON
, you should check it out. Basically you name one or more columns whose combination should be unique in the result, and then you give an ORDER BY
that includes those columns plus (usually) something else, and for each distinct combo, Postgres will include only the first row. That lets us write our query like so:
SELECT r.id, r.name, i.score, i.inspected_at
FROM restaurants r
LEFT OUTER JOIN (
SELECT DISTINCT ON (restaurant_id)
*
FROM inspections i2
ORDER BY restaurant_id, inspected_at DESC) i
ON i.restaurant_id = r.id
;
Here is the query plan I get with that query:
Merge Right Join (cost=0.56..3292.00 rows=1000 width=20)
Merge Cond: (i2.restaurant_id = r.id)
-> Unique (cost=0.29..3205.72 rows=1000 width=20)
-> Index Scan using idx_inspections_rest_and_date on inspections i2 (cost=0.29..3130.72 rows=30000 width=20)
-> Index Scan using restaurants_pkey on restaurants r (cost=0.28..61.27 rows=1000 width=8)
And results take about 13ms. That is pretty nice! Not quite as fast as the lateral join, but maybe close enough. It is arguably easier to read too. DISTINCT ON
is really a nice feature that is insufficiently known. I believe it is also available in MS SQL Server, but not in Oracle or MySQL—which may or may not matter to you. Personally I’m fine committing to one database vendor and leveraging it to the hilt, especially when that vendor is free, open source, and as mature as Postgres.
So anyway, I think this is an interesting lineup of alternatives to a lateral join, at least for my specific example. It’s fascinating to me that lateral join wound up the fastest of all, as that wasn’t why I originally selected it. But if you’re worried about performance, I wouldn’t depend too much on this blog post. I’d test the alternatives against your own real situation, because you might get different results. To me the lesson that is worthwhile is that these are all options (except maybe GROUP BY
).
What do you think? Could I have written any of these queries in a more performant way? Is there yet another approach that would work?
Btw, if you want a real hard-core example of a lateral join that is hard to write any other way, here is an example used in funnel analysis.
blog comments powered by Disqus Prev: Another reason to use generate_series for time series queries Next: Too Many Outer Joins