Someone posted a question recently to the Postgres mailing list that makes for a great puzzle. He had a table (we’ll call it m
) like this:
d | v
---------------+---------------
geography | north
geography | south
industry type | retail
industry type | manufacturing
industry type | wholesale
(5 rows)
Basically this table was a list of “dimensions” and the possible values for each dimension. So there is a “geography” dimension with possible values “north” and “south”, and an “industry type” dimension with values “retail”, “manufacturing”, and “wholesale”. The table could hold more than two dimensions or have more possible values for each dimension.
His goal was to find a query that would give him all possible combinations, assuming that each combination has exactly one value along each dimension. This is roughly the idea behind a cartesian product, so it seems like the thing you’d use a CROSS JOIN for, but here the data is all in one table, with n possible partitions. So how to write a query that works regardless of how many unique dimensions the table holds?
If you want to stop reading here and go work on the puzzle yourself, I wouldn’t blame you. And I’d love to hear what people come up with. I think it’s a really hard puzzle. If you want to see my solution, keep reading.
You can recreate the table with these commands:
CREATE TABLE m (
d VARCHAR(255) NOT NULL,
v VARCHAR(255) NOT NULL,
PRIMARY KEY (d, v)
);
INSERT INTO m
(d, v)
VALUES
('geography', 'north'),
('geography', 'south'),
('industry type', 'retail'),
('industry type', 'manufacturing'),
('industry type', 'wholesale');
My solution may not be the only one, but I’m proud of doing it in pure SQL (no plpgsql), and I think it’s a neat demonstration of several special features in Postgres. Here is the query:
WITH RECURSIVE t(combo, n) AS (
WITH dims AS (SELECT DISTINCT d, row_number() OVER () AS n FROM m GROUP BY d)
SELECT '{}'::text[], 1
UNION ALL
SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
FROM t t2, dims
CROSS JOIN m
WHERE m.d = dims.d AND dims.n = t2.n
)
SELECT combo
FROM t
WHERE n = (SELECT COUNT(DISTINCT d) + 1 FROM m);
The first thing to note is that we’re using a WITH
expression. This is called a Common Table Expression (CTE), and it’s basically a way to create a sub-query and give it a name. In our case the “table” is named t
, with columns combo
and n
.
But this CTE is special, because it uses the RECURSIVE keyword. With a recursive CTE, your definition is actually two queries, separated by UNION ALL
. The query on top is the starting condition, in our case just a single row. The query on bottom is executed repeatedly until it returns no more rows, and it is allowed to access the results of the previous execution. The full results of the CTE are the starting condition plus all the rows ever returned by the recursive part.
To work through our CTE, let’s look more closely at the starting condition. Here it is:
SELECT '{}'::text[], 1
That first column is a special Postgres feature called an array. An array is pretty much what you’d expect, although unlike in most programming languages they are 1-indexed. They have similar syntax for accessing their elements (although we don’t use it here):
some_array[4]
An array literal is actually a string cast to an array. Here is a literal for the first three positive integers:
'{1, 2, 3}'::int[]
So you can see that our CTE’s initial SELECT contains an empty text array.
The idea is for each array to represent one combination, so we might have {wholesale,north}
or {retail,south}
. We will grow the array one dimension at a time until all the dimensions are included. We are using arrays because they can have as many elements as we want, whereas a SQL query must have a fixed number of columns.
The second half of our CTE is how we grow the array. Here is the code:
SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
FROM t t2, dims
CROSS JOIN m
WHERE m.d = dims.d AND dims.n = t2.n
So t
(aliased to t2
) is whatever the previous iteration produced. For each row in t
, we want to create several new rows, one for each possible value of the next dimension, tacking those values onto the end of the array. In other words, if t
has the row {wholesale}
, we want to produce the new rows {wholesale,north}
and {wholesale,south}
. The CROSS JOIN
and array_append
accomplish this for us.
We also need some way to stop the recursion. Postgres will quit re-executing our query if it ever produces zero rows. To make sure this happens, we include a second “counter” column, which also serves as a way to grab a different dimension on each iteration. This is what our nested CTE named dims
is all about. It’s just a table with one row per dimension, and the rows numbered. Our WHERE clause makes sure that we process a different dimension on each iteration, and stop when there are none left.
The other thing of note is that in our nested CTE (dims
), we are using a fancy Postgres feature called a window function. Our window function is the row_number() OVER ()
part, which, as window functions go, is actually a crashing uninteresting specimen. Rather than turning this post into a window function tutorial, I’ll just say that you should go read about them, because they are wonderful. Basically they let you get values into a non-GROUP-BY query that you’d normally need aggregate functions to compute. They are like a SQL superpower.
Finally, you can see that in the outermost query, we are filtering out the “intermediate” rows from our CTE, so that we only get combinations that include all dimensions.
Here are the results:
combo
{wholesale,north}
{wholesale,south}
{retail,north}
{retail,south}
{manufacturing,north}
{manufacturing,south}
(6 rows)
I hope you enjoyed this puzzle! I’m delighted how it does something impossible in ordinary SQL (return results with a dynamic number of “columns”), and combines three exotic Postgres features you may not have seen before.
blog comments powered by Disqus Prev: Rules for Rails Migrations Next: Read-only rails_admin