Whenever I have to write an aggregate SQL query over time-series data, I start with generate_series
. Mostly that’s to avoid missing rows in the output. For instance, if I’m writing a query to show total sales in every hour, I don’t want to write this:
SELECT extract(hour FROM order_placed_at) h,
SUM(total_price)
FROM sales
WHERE order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h
because then if there was an hour with no sales, that row will be simply missing from the result.
Instead I use generate_series
to generate all the hours, and join my real table to that:
SELECT h,
COALESCE(SUM(total_price), 0) AS revenue_per_hour
FROM generate_series(0, 23) s(h)
LEFT OUTER JOIN sales
ON extract(hour FROM order_placed_at) = h
WHERE order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h
That way I get 24 rows every time.
But another interesting use case for generate_series
appeared on the Postgres mailing list: Suppose you have a table of events with start_time
and end_time
, and you want to report how many events were “ongoing” for each hour of the day. That means a row of data can be counted in multiple output rows. Without generate_series
this is hard to write, but if you use that as your “core” table then join to your data table, it just works:
SELECT h,
COUNT(events.id) events_per_hour,
array_remove(array_agg(events.id), NULL) active_events
FROM generate_series(0, 23) s(h)
LEFT OUTER JOIN events
ON h BETWEEN extract(hour FROM events.start_time) AND extract(hour FROM events.end_time)
GROUP BY h
ORDER BY h
That gives you one output row per hour, and events can be counted multiple times, within each hour that includes them. I’ve thrown in an array_agg
so you can see the IDs of each event, just to double-check my work.
But there are still problems. What if our query is supposed to include multiple days? How will it handle events that cross midnight? What about events that start on Monday and run until Wednesday? For these kind of things, we should avoid extract
in our BETWEEN
, and instead of generating a series of integers, we should generate a series of timestamps. This is also a great chance to use Postgres’s new(ish) range datatype, to avoid our own bug-prone interval comparisons:
SELECT extract(hour FROM s.h) AS hour,
COUNT(DISTINCT events.id) events_per_hour,
array_remove(array_agg(events.id), NULL) active_events
FROM generate_series(
date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
'1 hour') s(h)
LEFT OUTER JOIN events
ON tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour
Here we are using the &&
(overlaps) operator to check whether each event overlaps with each window. If we didn’t want to use ranges, we could also use the SQL OVERLAPS
operator, which accomplishes the same thing.
Now suppose we also want to report on the total time each event was “active” during each window. That is just a SUM()
we can add to our query, using the *
(intersection) range operator. But be careful about NULL
values coming from our LEFT JOIN
! In ranges, a null at either end represents unbounded, so tsrange(NULL, NULL)
is not NULL
, but all time from negative infinity to positive infinity! Here is a query that works around that:
SELECT extract(hour FROM s.h) AS hour,
COUNT(DISTINCT events.id) events_per_hour,
array_remove(array_agg(events.id), NULL) active_events,
SUM(CASE WHEN events.id IS NULL
THEN NULL
ELSE date_trunc('minute', range_to_interval(
tsrange(s.h, s.h + interval '1 hour') *
tsrange(events.start_time, events.end_time)
))
END) active_hours
FROM generate_series(
date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
'1 hour') s(h)
LEFT OUTER JOIN events
ON tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour
;
Oh one more thing… . There is no such thing as range_to_interval
. A surprising omission! But here is a function that defines it:
CREATE OR REPLACE FUNCTION range_to_interval(tsrange)
RETURNS interval
AS
$$
SELECT upper($1) - lower($1)
$$
LANGUAGE sql
STABLE;
It might be even more proper to create a user-defined cast here, but I’ll leave that as an excerise for the reader. :-)
blog comments powered by Disqus Prev: Paperclip expiring_url and "Request has expired" Next: Lateral Join Alternatives