Uh oh. You're right. I copied the code from one of my test runs instead of the final code. WillI will edit shortly.
p.s. I was completely off. My statement regarding multi-threading was inaccurate/incorrect. Perks of working late and not organizing my files (I was benchmarking the wrong executable). I have posted an updated benchmark that shows that my optimized c++ is not really that much ahead of yours. Their performance is virtually the same. I am on Alder Lake, by the way.