Revisions to Why was the Itanium processor difficult to write a compiler for?

Clarify that this is about an Apple M1, rather than ARM Cortex-M1

Source Link

edit approved Nov 4, 2024 at 7:23

807
6
13

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARMApple M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That Apple ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARMApple M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARM M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an Apple M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That Apple ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the Apple M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

added 664 characters in body

Source Link

edited Nov 24, 2023 at 0:10

gnasher729

49.4k
4
71
137

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARM M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing. So without any particular cleverness in the compiler I got 20 flops in 8 cycle out of a theoretical 24.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

What I do wonder is how much of the Itanium design could be used today. For example OOO execution on the ARM M1 is not perfect. If the available operations are close to the limit set by latency then you have delays, so handling two iterations of a loop could be beneficial. Instruction words of 32 bits might not be optimal. Maybe 128 bits = 3 times either a 42 or two 21 bit instruction would be better, plus two bits for hints. Maybe memory prefetch instructions would help. As long as code doesn’t have to be compiled for one specific model of the cpu.

Source Link

answered Nov 21, 2023 at 21:34

gnasher729

49.4k
4
71
137

With knowledge of the state of the art in 2023: The real problem of Itanium is that it can’t compete with OOO processors. I’ll give an example:

I wrote a bit of code taking four FP arguments, and running about 20 instructions with 40 cycles latency on an ARM M1. FP instructions have 4 cycles latency, you can dispatch 3 per cycle, so 120 ops are possible in those 40 cycles latency. All in a loop. The throughput was about one iteration every 8 cycles, 5 times faster, because of OOO processing.

On Itanium, in a loop that would take superhuman effort. Each loop would have to perform operations from five different loop iterations to cover the latency. That’s tough. You start with one op from iteration 1. Then the first op of iteration 2 and the second of iteration 1. Then 3 ops from 3 iterations, 4 ops from 4 iterations, five ops from five iterations, and then you iterate. But you also have to handle the case of 1, 2, 3 or 4 iterations only. And if there is a branch in your loop it is worse. But now consider there is no loop. You’ll have to find ops to fill the latency. Now consider your function returns. Nothing you can do now on the Itanium. You’re going to waste a lot of time. That ARM processor will just do it for you. It can gather several hundred operations and do them when they are ready.

Now that was one feature. Itanium had several features of similar complexity. The processor just cannot keep up with a modern processor, no matter how good the compiler is.

Stack Exchange Network

Return to Answer