Revisions to Determining the least dangerous tags with over 1K answers to answer

added 202 characters in body

Source Link

edited Apr 30, 2024 at 12:37

anon

For clarity, in your question you say "answers" but your query will only return up and down votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions. In theory, voting patterns on questions vs. answers in any given tag could be opposite.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, there are 72 million rows in PostTags in Stack Overflow vs. 4 million in Math, and the Votes table is almost 20 times larger (239 million rows vs. 12 million). These lead to cascading effects throughout the plan and drastically affect how much data SEDE has to read to get the results.

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. There are 72 million rows in PostTags in Stack Overflow, vs. 4 million in Math. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

For clarity, in your question you say "answers" but your query will only return up and down votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, and the Votes table is almost 20 times larger (239 million rows vs. 12 million).

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. There are 72 million rows in PostTags in Stack Overflow, vs. 4 million in Math. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

For clarity, in your question you say "answers" but your query will only return votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions. In theory, voting patterns on questions vs. answers in any given tag could be opposite.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, there are 72 million rows in PostTags in Stack Overflow vs. 4 million in Math, and the Votes table is almost 20 times larger (239 million rows vs. 12 million). These lead to cascading effects throughout the plan and drastically affect how much data SEDE has to read to get the results.

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

added 1929 characters in body

Source Link

edited Apr 30, 2024 at 12:27

anon

For clarity, in your question you say "answers" but your query will only return up and down votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions.

But I don't think you should change the question to add that complication; it won't really change any of the prescriptive guidance anyway. I just wanted to be sure you were aware that, in logical terms, your query tells you the safest tags to post questions in, not to post answers in.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, and the Votes table is almost 20 times larger (239 million rows vs. 12 million).

If I increase the threshold to 10,000, it returns 845 rows in 51 seconds.

SEDE query link

You might consider a different approach that would be a little more linear at least in terms of volume of questions on any given site, which is not "all the tags with at least 1,000 questions" but rather "give me the 1,000 tags with the most questions", e.g.:

SELECT TOP (1000) TagId, [count] = COUNT(*)
  FROM PostTags
 GROUP BY TagId
 ORDER BY [count] DESC;

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. There are 72 million rows in PostTags in Stack Overflow, vs. 4 million in Math. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

SEDE query link

For clarity, in your question you say "answers" but your query will only return up and down votes on questions since tags are only directly associated with questions. You'd have to join to posts again first to get the votes for the answers associated with questions in the tags with at least 1,000 questions.

But I don't think you should change the question to add that complication; it won't really change any of the prescriptive guidance anyway. I just wanted to be sure you were aware that, in logical terms, your query tells you the safest tags to post questions in, not to post answers in.

And finally, my version almost times out on Stack Overflow too (I got it to run in 119 seconds, just beating the timeout). This is because the threshold for tags to include is too low; there are almost 6,000 tags that meet the 1,000-post criteria, and the Votes table is almost 20 times larger (239 million rows vs. 12 million).

If I increase the threshold to 10,000, it returns 845 rows in 51 seconds.

SEDE query link

You might consider a different approach that would be a little more linear at least in terms of volume of questions on any given site, which is not "all the tags with at least 1,000 questions" but rather "give me the 1,000 tags with the most questions", e.g.:

SELECT TOP (1000) TagId, [count] = COUNT(*)
  FROM PostTags
 GROUP BY TagId
 ORDER BY [count] DESC;

But in Stack Overflow, that takes more than 5 seconds on its own, again due to volume. There are 72 million rows in PostTags in Stack Overflow, vs. 4 million in Math. I tried it anyway but using 500 instead of 1,000; it returned in 37 seconds.

SEDE query link

Source Link

answered Apr 30, 2024 at 12:01

anon

Well, that it returns in 10 seconds is already pretty good, considering you:

need to count all the up and down votes for every post across almost 2,000 tags
scan the entire Votes table twice, leading to reading almost 25 million rows (due to 8x parallelism)
sort five times (up to 10 million rows)
include tags in the aggregate scan just to get the name (you should defer getting names until after aggregation)
won't get help from indexes tailored specifically for this query
won't get any noticeable benefit from statistics

How much faster than 10 seconds does it need to be? How fast do you expect this type of aggregate query to be?

Here is the complex plan for your existing query, with lots of thick lines and with the pain points highlighted (right-click to enlarge in a new tab):

Well, it can be a lot faster, relatively, but if the complexity of the query makes maintenance or understanding harder, it's probably not worth saving a few seconds of runtime.

I can't benefit any more from indexes or stats than you can (well, I could if I changed the database), and I can't change the fact that reading all of the votes for all of the posts across 2,000 tags is never going to be cheap.

But we can restructure the query so that Votes is only scanned once, the expensive sorts are removed, and the runtime is about 2 seconds. I don't like that we still have to scan PostTags twice, and it can probably be avoided with window functions, but I think an 80% improvement is a pretty good start.

;WITH BusyTags AS /* isolate the tag aggregate */
(
  SELECT TagId
    FROM PostTags
   GROUP BY TagId
  HAVING COUNT(*) >= 1000
), t AS /* then get the names */
(
  SELECT bt.TagId, t.TagName
    FROM BusyTags AS bt
   INNER JOIN Tags AS t
      ON bt.TagId = t.Id
), pt AS /* now get the posts */
(
  SELECT pt.PostId, t.TagId, t.TagName
    FROM PostTags AS pt
   INNER JOIN t
      ON pt.TagId = t.TagId
), v AS /* then sum the votes with a single scan */
(
  SELECT pt.TagId, Tags = pt.TagName, 
         UpVotes   = SUM(CASE WHEN v.VoteTypeId = 2 THEN 1 ELSE 0 END), 
         DownVotes = SUM(CASE WHEN v.VoteTypeId = 3 THEN 1 ELSE 0 END)
    FROM pt
   INNER JOIN Votes AS v
      ON pt.PostId = v.PostId
   GROUP BY pt.TagId, pt.TagName
)
SELECT Tags, 
       ratio = CONVERT(DECIMAL(5,2), 
               1.0 * Upvotes/NULLIF(DownVotes,0))
  FROM v
 ORDER BY ratio DESC;

Here is the simpler plan, with fewer (largely unavoidable) pain points highlighted (right-click to enlarge in a new tab):

SEDE query link

_{To see real runtime instead of benefitting from cached results, make a meaningless text change to the query and/or enable Include execution plan.}

Stack Exchange Network

Return to Answer