Skip to main content
0 votes
2 answers
38 views

"docker: no matching manifest for linux/amd64 in the manifest list entries"

I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this: $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes ...
optical's user avatar
  • 267
3 votes
0 answers
47 views

After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?

In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...
Ömer GÜZEL's user avatar
2 votes
0 answers
75 views

Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference

I am running simple Ping/Pong between two processes A, B with shared memory: shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...
Samuel Hapak's user avatar
  • 7,274
-2 votes
0 answers
89 views

What happens to renamed registers during an Interrupt/Exception?

Consider the following assembly code: (1) r1 = r2 / r3 (2) r2 = r1 + r3 (3) r1 = r3 + r5 (4) r4 = r1 - r6 Here (2) must be executed after (1) because (2) depends of the value of r1, which is ...
Alex Gendelbergen's user avatar
2 votes
1 answer
60 views

Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?

I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture. The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...
zenprogrammer's user avatar
1 vote
1 answer
82 views

How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?

Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...
SungwookKang's user avatar
0 votes
0 answers
48 views

How to decide the data size handled by each processor/core in SIMD?

I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing. Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...
王子儀's user avatar
2 votes
2 answers
149 views

Fractional-cycle latency of CPU instructions

I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs. For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...
swineone's user avatar
  • 3,000
1 vote
1 answer
90 views

Is CPU multithreading effected by divergence?

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
bigcodeszzer's user avatar
7 votes
1 answer
209 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...
Andrey Dmitriev's user avatar
1 vote
1 answer
98 views

Which resources of a modern x86 CPU core are occupied by memory transactions in flight?

I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...
xealits's user avatar
  • 4,808
3 votes
2 answers
147 views

How to support Carryless Multiplication operation in .NET 8.0 on various platforms

I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast. Method documentation: https://learn....
PanJanek's user avatar
  • 6,717
0 votes
0 answers
87 views

How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?

I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example: // Spinlock acquisition attempt if (...
Delark's user avatar
  • 1,385
0 votes
0 answers
37 views

What protocol does the LLC directory uses to synchronize parallel RFO signals?

The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work (in TSO architectures that ...
Delark's user avatar
  • 1,385
3 votes
0 answers
117 views

IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...
Transcendental's user avatar

15 30 50 per page
1
2 3 4 5
288