Determining the least dangerous tags with over 1K answers to answer

Question

So I thought I'd take another shot at making a SEDE query after being somewhat inactive after a while.

Essentially what this does is that it takes the tags that have at least 1000 answers for that tag and ranks them based on the ratio of upvotes to downvotes (you won't get downvoted as often for participating in that tag).

SEDE query link

Code:

WITH q AS
        (SELECT  t.*,
             -- Count upvotes on each tag
                (SELECT COUNT(*) AS cnt
                FROM PostTags pt
                JOIN Posts p ON p.Id = pt.Postid
                INNER JOIN Votes v ON v.PostId = p.Id
             -- Faster than using
             -- INNER JOIN Posts pa ON pa.ParentId = p.Id
             -- INNER JOIN Votes v ON v.PostId = pa.Id
                WHERE pt.Tagid = t.Id
                AND v.VoteTypeId = 2) AS upvotes,
             -- Now counting downvotes
                (SELECT COUNT(*) AS cnt
                FROM PostTags pt
                JOIN Posts p ON p.Id = pt.PostId
                INNER JOIN Votes v ON v.PostId = p.Id
                WHERE pt.TagId = t.Id
                AND v.VoteTypeId = 3) AS downvotes
        FROM Tags t
        CROSS APPLY
                (SELECT COUNT(*) AS cnt
                FROM PostTags pt
                WHERE pt.tagid = t.id
                HAVING COUNT(*) >= 1000
             -- Check if the tag has 1K+ answers
                ) pt
        )
SELECT tagname AS [Tags],
ROUND(1.0 * upvotes / NULLIF(downvotes, 0), 2) AS [ratio]
FROM q
ORDER BY [ratio] DESC

My question: Can this be optimized runtime-wise, or is this as optimized as it can get? It seems kinda weird to me that it returns a runtime error when I try to run this on Stack Overflow, but will give an output on Math.SE in only 8-9 seconds.

The gain may be minor, but it is good practice to avoid broad SELECT * statements. Don't fetch more data that you really need. Accordingly, a SELECT t.* is not needed to achieve the intended result. PS: the execution plan gives a few clues as well. — Kate
– Kate, Commented Apr 30, 2024 at 1:06
@Kate Could you give me another hint? I don't exactly know where in the execution plan I would need to look for the clues. (although replacing SELECT t.* with SELECT * seems to do the exact same thing in the same amount of time, so there's that.) — CrSb0001
– CrSb0001, Commented Apr 30, 2024 at 1:12
In your question you say "answers" but I don't think your query isolates answers - in fact your query will only return up and down votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the top 1000 tags. — anon
– anon, Commented Apr 30, 2024 at 12:08

score 4 · Accepted Answer · 2024-04-30 12:37:56Z

Well, that it returns in 10 seconds is already pretty good, considering you:

need to count all the up and down votes for every post across almost 2,000 tags
scan the entire Votes table twice, leading to reading almost 25 million rows (due to 8x parallelism)
sort five times (up to 10 million rows)
include tags in the aggregate scan just to get the name (you should defer getting names until after aggregation)
won't get help from indexes tailored specifically for this query
won't get any noticeable benefit from statistics

How much faster than 10 seconds does it need to be? How fast do you expect this type of aggregate query to be?

Here is the complex plan for your existing query, with lots of thick lines and with the pain points highlighted (right-click to enlarge in a new tab):

Well, it can be a lot faster, relatively, but if the complexity of the query makes maintenance or understanding harder, it's probably not worth saving a few seconds of runtime.

I can't benefit any more from indexes or stats than you can (well, I could if I changed the database), and I can't change the fact that reading all of the votes for all of the posts across 2,000 tags is never going to be cheap.

But we can restructure the query so that Votes is only scanned once, the expensive sorts are removed, and the runtime is about 2 seconds. I don't like that we still have to scan PostTags twice, and it can probably be avoided with window functions, but I think an 80% improvement is a pretty good start.

;WITH BusyTags AS /* isolate the tag aggregate */
(
  SELECT TagId
    FROM PostTags
   GROUP BY TagId
  HAVING COUNT(*) >= 1000
), t AS /* then get the names */
(
  SELECT bt.TagId, t.TagName
    FROM BusyTags AS bt
   INNER JOIN Tags AS t
      ON bt.TagId = t.Id
), pt AS /* now get the posts */
(
  SELECT pt.PostId, t.TagId, t.TagName
    FROM PostTags AS pt
   INNER JOIN t
      ON pt.TagId = t.TagId
), v AS /* then sum the votes with a single scan */
(
  SELECT pt.TagId, Tags = pt.TagName, 
         UpVotes   = SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END), 
         DownVotes = SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END)
    FROM pt
   INNER JOIN Votes AS v
      ON pt.PostId = v.PostId
   GROUP BY pt.TagId, pt.TagName
)
SELECT Tags, 
       ratio = CONVERT(DECIMAL(5,2), 
               1.0 * Upvotes/NULLIF(DownVotes,0))
  FROM v
 ORDER BY ratio DESC;

Here is the simpler plan, with fewer (largely unavoidable) pain points highlighted (right-click to enlarge in a new tab):

SEDE query link

_{To see real runtime instead of benefitting from cached results, make a meaningless text change to the query and/or enable Include execution plan.}

For clarity, in your question you say "answers" but your query will only return votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions. In theory, voting patterns on questions vs. answers in any given tag could be opposite.

But I don't think you should change the question to add that complication; it won't really change any of the prescriptive guidance anyway. I just wanted to be sure you were aware that, in logical terms, your query tells you the safest tags to post questions in, not to post answers in.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, there are 72 million rows in PostTags in Stack Overflow vs. 4 million in Math, and the Votes table is almost 20 times larger (239 million rows vs. 12 million). These lead to cascading effects throughout the plan and drastically affect how much data SEDE has to read to get the results.

If I increase the threshold to 10,000, it returns 845 rows in 51 seconds.

SEDE query link

You might consider a different approach that would be a little more linear at least in terms of volume of questions on any given site, which is not "all the tags with at least 1,000 questions" but rather "give me the 1,000 tags with the most questions", e.g.:

SELECT TOP (1000) TagId, [count] = COUNT(*)
  FROM PostTags
 GROUP BY TagId
 ORDER BY [count] DESC;

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

SEDE query link

You sure that it returns on SO in 119 seconds when testing? When I ran it on SO it gave me all 5469 rows that met the requirement in only ~90.8 seconds. — CrSb0001
– CrSb0001, Commented Apr 30, 2024 at 13:45
@CrSb0001 Yes, I’m pretty sure I read the duration correctly when I ran it the first time. But if my query ran in 119 seconds, you running it again might benefit slightly because now the data is all in memory. Tomorrow, or next week, it may be very different. It may vary from execution to execution even in a short time and disregarding memory, too, depending on how much other activity is going on in SEDE at any given time. It's still a server with a given set of resources that are shared and finite, so accurately predicting runtime of any given query every time it runs in the future is hard. — anon
– anon, Commented Apr 30, 2024 at 13:50
...particularly on the site with the most data - by at least an order of magnitude - and the highest likelihood for contention. :-) Curious what you would really do with data about 5,469 tags on Stack Overflow ordered by "safety"... that seems like a lot to process manually if you're using it to determine where you should focus your engagement. — anon
– anon, Commented Apr 30, 2024 at 13:58
@AaronBertand I guess my main intention was finding "safe" tags to participate in on smaller sites such as Puzzling and Math.SE. The main reason I was testing on SO was because on my previous SEDE question, J_H (the person who answered) mentioned that it would be interesting to publish some "hard" instances of the query (i.e. testing on SO). Either that or there was a comment on my question saying that I should do that to test the query's capabilities, although I'm not really sure since the comments have all been deleted. — CrSb0001
– CrSb0001, Commented Apr 30, 2024 at 14:09
@CrSb0001 There are only 16 tags on puzzling that even meet your 1,000-post criteria, but the skew in "safety" is pretty significant. — anon
– anon, Commented Apr 30, 2024 at 14:16

Stack Exchange Network

Determining the least dangerous tags with over 1K answers to answer

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Determining the least dangerous tags with over 1K answers to answer

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions