4,316 questions
0
votes
2
answers
38
views
"docker: no matching manifest for linux/amd64 in the manifest list entries"
I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this:
$ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
...
3
votes
0
answers
47
views
After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?
In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...
2
votes
0
answers
75
views
Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference
I am running simple Ping/Pong between two processes A, B with shared memory:
shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...
-2
votes
0
answers
89
views
What happens to renamed registers during an Interrupt/Exception?
Consider the following assembly code:
(1) r1 = r2 / r3
(2) r2 = r1 + r3
(3) r1 = r3 + r5
(4) r4 = r1 - r6
Here (2) must be executed after (1) because (2) depends of the value of r1, which is ...
2
votes
1
answer
60
views
Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?
I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture.
The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...
1
vote
1
answer
82
views
How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?
Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...
0
votes
0
answers
48
views
How to decide the data size handled by each processor/core in SIMD?
I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing.
Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...
2
votes
2
answers
149
views
Fractional-cycle latency of CPU instructions
I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs.
For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...
1
vote
1
answer
90
views
Is CPU multithreading effected by divergence?
Building on this question here
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
7
votes
1
answer
209
views
Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?
I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled).
My test loop is ...
1
vote
1
answer
98
views
Which resources of a modern x86 CPU core are occupied by memory transactions in flight?
I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...
3
votes
2
answers
147
views
How to support Carryless Multiplication operation in .NET 8.0 on various platforms
I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast.
Method documentation: https://learn....
0
votes
0
answers
87
views
How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?
I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example:
// Spinlock acquisition attempt
if (...
0
votes
0
answers
37
views
What protocol does the LLC directory uses to synchronize parallel RFO signals?
The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work
(in TSO architectures that ...
3
votes
0
answers
117
views
IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?
I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...