Postgres CTE for Threaded Comments

2014-09-17

Not long ago I answered a question on the Postgres mailing list I thought was pretty fun: How do you construct a recursive CTE to pull a whole tree of threaded comments, sorting sibling comments by votes?

Threaded, scored comments are what you see on sites like Reddit and Hacker News:

reddit comments

hacker news comments

Threaded means the comments appear in a tree-like structure. Contrast this with Wordpress or Discourse, where every comment appears at the end:

discourse comments

Disqus (which uses threaded comments) has a nice article about implementing them in Postgres with a recursive CTE. They start with this data:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id)
);
INSERT INTO comments (message, author, parent_id)
  VALUES
  ('This thread is really cool!', 'David', NULL),
  ('Ya David, we love it!', 'Jason', 1),
  ('I agree David!', 'Daniel', 1),
  ('gift Jason', 'Anton', 2),
  ('Very interesting post!', 'thedz', NULL),
  ('You sir, are wrong', 'Chris', 5),
  ('Agreed', 'G', 5),
  ('Fo sho, Yall', 'Mac', 5);

Then they recommend using this query:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth)  AS (
    SELECT  id,
        message,
        author,
        array[id] AS path,
        parent_id,
        1 AS depth
    FROM    comments
    WHERE   parent_id IS NULL

    UNION ALL

    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth FROM cte
ORDER BY path;

Essentially what’s happening here is that as we recursively execute the CTE, we build up a “path” to each comment, with the IDs of all ancestors. Then we can sort by the path, so that comments with the same ancestors sort together. This technique relies on the fact that the array A will sort before all other arrays that begin with A, so parents will appear before their children.

(Note that the Disqus solution returns the depth of each comment. This is not really necessary, since it is just the length of path, but I’ve left it in the query because computing depth is often necessary with recursive CTEs, so it might be interesting if you haven’t seen it before.)

That gets you the comments sorted as a tree, but the question on the mailing list was how to also sort by votes, so that while preserving the tree structure, comments with higher votes would appear before their siblings.

To figure this out, let’s modify the Disqus sample data so each comment has a score:

CREATE TABLE comments (
  id SERIAL PRIMARY KEY,
  message VARCHAR,
  author VARCHAR,
  parent_id INTEGER REFERENCES comments(id),
  votes INTEGER
);
INSERT INTO comments (message, author, parent_id, votes)
  VALUES
  ('This thread is really cool!', 'David', NULL, 1),
  ('Ya David, we love it!', 'Jason', 1, 3),
  ('I agree David!', 'Daniel', 1, 4),
  ('gift Jason', 'Anton', 2, 15),
  ('Very interesting post!', 'thedz', NULL, 3),
  ('You sir, are wrong', 'Chris', 5, 3),
  ('Agreed', 'G', 5, 5),
  ('Fo sho, Yall', 'Mac', 5, 12);

One intuitive approach is to change path to include only ancestors, and then resolve ties by looking at votes:

. . .
ORDER BY path, votes DESC

The problem is that this approach fails to keep children with their parents. We need to keep the full chain of IDs.

The correct answer is to sort by path, like in the Disqus article, but at each step also sort by votes. “At each step” is the key. Conceptually what we want is for each step along the path to sort as if by votes DESC, id ASC. To make that happen, rather than constructing the path soley of IDs, we can build it out of (votes, id) tuples. Then we still get our tree, but siblings will differ in the second-to-last array element.

One wrinkle is that we want to sort by votes in reverse order. But it’s easy to just negate the votes to achieve this. Here is the SQL:

WITH RECURSIVE cte (id, message, author, path, parent_id, depth, votes)  AS (
    SELECT  id,
        message,
        author,
        array[-votes,id] AS path,
        parent_id,
        1 AS depth,
        votes
    FROM    comments
    WHERE   parent_id IS NULL
    UNION ALL
    SELECT  comments.id,
        comments.message,
        comments.author,
        cte.path || -comments.votes || comments.id,
        comments.parent_id,
        cte.depth + 1 AS depth,
        comments.votes
    FROM    comments
    JOIN cte ON comments.parent_id = cte.id
    )
    SELECT id, message, author, path, depth, votes FROM cte
ORDER BY path;

Running it get us this:

 id |           message           | author |       path        | depth | votes 
----+-----------------------------+--------+-------------------+-------+-------
  5 | Very interesting post!      | thedz  | {-3,5}            |     1 |     3
  8 | Fo sho, Yall                | Mac    | {-3,5,-12,8}      |     2 |    12
  7 | Agreed                      | G      | {-3,5,-5,7}       |     2 |     5
  6 | You sir, are wrong          | Chris  | {-3,5,-3,6}       |     2 |     3
  1 | This thread is really cool! | David  | {-1,1}            |     1 |     1
  3 | I agree David!              | Daniel | {-1,1,-4,3}       |     2 |     4
  2 | Ya David, we love it!       | Jason  | {-1,1,-3,2}       |     2 |     3
  4 | gift Jason                  | Anton  | {-1,1,-3,2,-15,4} |     3 |    15

You can see that high-scoring comments are sorted higher than their siblings! I hope you enjoyed this walk through recursive CTEs for threaded, scored comments. I thought it was a pretty fun problem!

blog comments powered by Disqus Prev: Rails acts_as_list with Soft Delete Next: Flash movie (.swf) won't load in Rails 4