With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:
I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARMApple M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.
On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That Apple ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.
Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.
What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARMApple M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.