Newest 'cpu-architecture' Questions

0 votes

2 answers

38 views

"docker: no matching manifest for linux/amd64 in the manifest list entries"

I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this: $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes ...

optical

267

asked 16 hours ago

3 votes

0 answers

47 views

After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?

In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...

Ömer GÜZEL

195

asked yesterday

2 votes

0 answers

75 views

Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference

I am running simple Ping/Pong between two processes A, B with shared memory: shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...

Samuel Hapak

7,274

asked Oct 16 at 7:52

-2 votes

0 answers

89 views

What happens to renamed registers during an Interrupt/Exception?

Consider the following assembly code: (1) r1 = r2 / r3 (2) r2 = r1 + r3 (3) r1 = r3 + r5 (4) r4 = r1 - r6 Here (2) must be executed after (1) because (2) depends of the value of r1, which is ...

Alex Gendelbergen

29

asked Oct 5 at 11:57

2 votes

1 answer

60 views

Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?

I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture. The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...

zenprogrammer

751

asked Sep 30 at 21:46

1 vote

1 answer

82 views

How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?

Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...

SungwookKang

13

asked Sep 30 at 10:10

0 votes

0 answers

48 views

How to decide the data size handled by each processor/core in SIMD?

I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing. Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...

王子儀

1

asked Sep 29 at 16:57

2 votes

2 answers

149 views

Fractional-cycle latency of CPU instructions

I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs. For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...

swineone

3,000

asked Sep 18 at 16:13

1 vote

1 answer

90 views

Is CPU multithreading effected by divergence?

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...

bigcodeszzer

952

asked Sep 18 at 1:37

7 votes

1 answer

209 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...

Andrey Dmitriev

179

asked Sep 12 at 9:26

1 vote

1 answer

98 views

Which resources of a modern x86 CPU core are occupied by memory transactions in flight?

I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...

xealits

4,808

asked Sep 5 at 0:27

3 votes

2 answers

147 views

How to support Carryless Multiplication operation in .NET 8.0 on various platforms

I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast. Method documentation: https://learn....

PanJanek

6,717

asked Sep 1 at 13:55

0 votes

0 answers

87 views

How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?

I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example: // Spinlock acquisition attempt if (...

Delark

1,385

asked Aug 28 at 15:52

0 votes

0 answers

37 views

What protocol does the LLC directory uses to synchronize parallel RFO signals?

The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work (in TSO architectures that ...

Delark

1,385

asked Aug 27 at 0:47

3 votes

0 answers

117 views

IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...

Transcendental

969

asked Aug 26 at 1:47

Collectives™ on Stack Overflow

"docker: no matching manifest for linux/amd64 in the manifest list entries"

After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?

Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference

What happens to renamed registers during an Interrupt/Exception?

Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?

How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?

How to decide the data size handled by each processor/core in SIMD?

Fractional-cycle latency of CPU instructions

Is CPU multithreading effected by divergence?

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

Which resources of a modern x86 CPU core are occupied by memory transactions in flight?

How to support Carryless Multiplication operation in .NET 8.0 on various platforms

How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?

What protocol does the LLC directory uses to synchronize parallel RFO signals?

IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

Hot Network Questions