License: CC BY 4.0
arXiv:2605.23603v1 [cs.LG] 22 May 2026

Preisach Attention: A Hysteretic Model of Sequential Memory

Piotr Frydrych
The Metrology and Biomedical Engineering Institute
Faculty of Mechatronics, Warsaw University of Technology
piotr.frydrych@pw.edu.pl
Abstract

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator γ^αβ\hat{\gamma}_{\alpha\beta} parameterised by learned activation and deactivation thresholds (α,β)(\alpha,\beta), maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1)O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton — in contrast to the O(logn)O(\log n) depth required by standard hard-attention transformers (Pérez et al., 2021). Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1)O(1) layers that require O(logn)O(\log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence — PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(nlogn)O(n\log n) total inference cost versus O(n2)O(n^{2}) for standard attention.

Keywords: Preisach operator, hysteresis, attention mechanism, Turing completeness, expressiveness, sequence modelling, long-range dependence, rate-independence.

1  Introduction

The transformer architecture (Vaswani et al., 2017) and its attention mechanism have become the dominant paradigm for sequence modelling. Theoretical analysis of transformer expressiveness has established that hard-attention transformers are Turing-complete (Pérez et al., 2021), that softmax attention corresponds to specific fragments of first-order logic (Hahn, 2020; Barceló et al., 2020), and that practical transformers implement approximate versions of classical algorithms (Akyürek et al., 2023; von Oswald et al., 2023).

A parallel body of work has studied alternative sequence models — recurrent networks (Siegelmann and Sontag, 1995), state space models (Gu and Dao, 2023), and mixture-of-experts architectures (Shazeer et al., 2017) — seeking more efficient representations of long-range dependence. These efforts share a common limitation: their memory mechanisms are either temporal (decaying by distance) or positional (indexed by token location), with no mechanism for value-based memory that persists based on the significance of past inputs rather than their recency.

This paper.

We propose a fundamentally different memory mechanism inspired by the Preisach hysteresis operator (Preisach, 1935), a classical model from mathematical physics used to describe ferromagnetic materials, elastoplastic systems, and — more recently — agent-based financial markets (Frydrych and Szewczyk, 2014; Frydrych, 2019). The Preisach operator aggregates the outputs of binary relays γ^αβ\hat{\gamma}_{\alpha\beta}, each characterised by an activation threshold α\alpha and a deactivation threshold β\beta, weighted by a learned measure μ(α,β)\mu(\alpha,\beta) over the threshold plane.

The key structural properties of the Preisach operator that motivate its use in sequence modelling are:

  1. 1.

    Rate-independence: the output depends only on the sequence of local extrema of the input, not on temporal spacing or absolute position.

  2. 2.

    Wiping property: a new extremum erases all previous extrema of smaller magnitude, providing a natural forgetting mechanism based on significance rather than recency.

  3. 3.

    Extremum stack sufficiency: the stack of alternating local maxima and minima is a minimal sufficient statistic of input history for all rate-independent functionals.

  4. 4.

    Universal approximation: the class of Preisach operators with μL2(𝒯)\mu\in L^{2}(\mathcal{T}) is dense in the space of all continuous, causal, rate-independent functionals (Mayergoyz, 1991).

Main contributions.
  1. 1.

    We define the Preisach Attention Layer (PAL) and its multi-head variant (MPAL), giving exact computational definitions and connecting them to the classical Preisach operator (section˜3).

  2. 2.

    We prove that a single-layer PAL-Transformer is Turing-complete (section˜4), establishing Turing completeness at depth O(1)O(1) — lower than the O(logn)O(\log n) depth of Pérez et al. (2021).

  3. 3.

    We prove a formal expressiveness separation between PAL and transformer attention (section˜5), identifying rate-independence as the separating property.

  4. 4.

    We characterise the function class of PAL through its connection to rate-independent operators and provide a logical characterisation analogous to Barceló et al. (2020) (section˜6).

  5. 5.

    We outline computational complexity advantages of PAL and identify task classes where PAL is predicted to outperform standard attention (section˜7).

Section˜2 reviews background; section˜3 defines PAL; section˜46 contain the main proofs; section˜7 analyses computational complexity; sections˜8 and 10 discuss related work and conclude.

2  Background

2.1  The Preisach Hysteresis Operator

The Preisach operator (Preisach, 1935) was introduced to model magnetic hysteresis in ferromagnetic materials. Let 𝒯={(α,β)2:αβ}\mathcal{T}=\{(\alpha,\beta)\in\mathbb{R}^{2}:\alpha\geq\beta\} be the Preisach half-plane.

Definition 2.1 (Elementary relay).

For (α,β)𝒯(\alpha,\beta)\in\mathcal{T} and input sequence u=(u0,u1,)u=(u_{0},u_{1},\ldots)\in\mathbb{R}^{\mathbb{N}}, the elementary relay γ^αβ[u]:{0,1}\hat{\gamma}_{\alpha\beta}[u]:\mathbb{N}\to\{0,1\} is defined by:

γ^αβ[u](n)={1if unα,0if unβ,γ^αβ[u](n1)otherwise.\hat{\gamma}_{\alpha\beta}[u](n)=\begin{cases}1&\text{if }u_{n}\geq\alpha,\\ 0&\text{if }u_{n}\leq\beta,\\ \hat{\gamma}_{\alpha\beta}[u](n-1)&\text{otherwise.}\end{cases} (1)

The interval (β,α)(\beta,\alpha) is the dead band of the relay.

Definition 2.2 (Preisach operator).

Let μL1(𝒯)\mu\in L^{1}(\mathcal{T}) be a signed measure on 𝒯\mathcal{T}. The Preisach operator 𝒫μ:\mathcal{P}_{\mu}:\mathbb{R}^{\mathbb{N}}\to\mathbb{R}^{\mathbb{N}} is:

𝒫μ[u](n)=𝒯γ^αβ[u](n)μ(α,β)𝑑α𝑑β.\mathcal{P}_{\mu}[u](n)=\iint_{\mathcal{T}}\hat{\gamma}_{\alpha\beta}[u](n)\;\mu(\alpha,\beta)\,d\alpha\,d\beta. (2)

The Preisach operator has three properties central to this paper.

Proposition 2.3 (Rate-independence (Brokate and Sprekels, 1996)).

For any non-decreasing bijection ϕ:[0,T][0,T]\phi:[0,T]\to[0,T], 𝒫μ[uϕ]=𝒫μ[u]ϕ\mathcal{P}_{\mu}[u\circ\phi]=\mathcal{P}_{\mu}[u]\circ\phi.

Proposition 2.4 (Wiping property (Mayergoyz, 1991)).

Let Mi>mi>Mi+1>mi+1>M_{i}>m_{i}>M_{i+1}>m_{i+1}>\ldots be the alternating local maxima and minima of uu. If Mi+1>MiM_{i+1}>M_{i}, then the relay state induced by MiM_{i} is erased — the output 𝒫μ[u]\mathcal{P}_{\mu}[u] after Mi+1M_{i+1} is identical to what it would be had MiM_{i} not occurred.

Proposition 2.5 (Universal approximation (Mayergoyz, 1991)).

The set {𝒫μ:μL2(𝒯)}\{\mathcal{P}_{\mu}:\mu\in L^{2}(\mathcal{T})\} is dense in the space of all continuous, causal, rate-independent functionals on C([0,T])C([0,T]) under the uniform topology.

2.2  The Extremum Stack

A fundamental algorithmic consequence of the wiping property is that the Preisach output depends only on the current extremum stack:

Definition 2.6 (Extremum stack).

The extremum stack of u0:nu_{0:n} is the sequence:

Πn=[(M1,m1),(M2,m2),,(Mk,mk)]\Pi_{n}=[(M_{1},m_{1}),(M_{2},m_{2}),\ldots,(M_{k},m_{k})] (3)

of alternating local maxima M1>M2>>MkM_{1}>M_{2}>\ldots>M_{k} and local minima m1>m2>>mkm_{1}>m_{2}>\ldots>m_{k}, stored in decreasing order, where the pair (Mi,mi)(M_{i},m_{i}) records the ii-th local maximum and its subsequent local minimum.

Proposition 2.7 (Stack sufficiency).

𝒫μ[u](n)\mathcal{P}_{\mu}[u](n) is a measurable function of Πn\Pi_{n} alone. In particular, two sequences uu and vv with identical extremum stacks Πnu=Πnv\Pi_{n}^{u}=\Pi_{n}^{v} satisfy 𝒫μ[u](n)=𝒫μ[v](n)\mathcal{P}_{\mu}[u](n)=\mathcal{P}_{\mu}[v](n) for all μ\mu.

Algorithm 1 Extremum Stack Update
1:Current stack Πn\Pi_{n}, new observation un+1u_{n+1}, previous observation unu_{n}
2:Updated stack Πn+1\Pi_{n+1}
3:if un+1>unu_{n+1}>u_{n} then \triangleright Input rises: update maxima
4:  while Πn\Pi_{n}\neq\emptyset and Mtop<un+1M_{\mathrm{top}}<u_{n+1} do
5:   mlastmtopm_{\mathrm{last}}\leftarrow m_{\mathrm{top}} \triangleright Save last minimum before popping
6:   pop(Πn)\mathrm{pop}(\Pi_{n}) \triangleright Wiping
7:  end while
8:  push(Πn,(un+1,mlast))\mathrm{push}(\Pi_{n},(u_{n+1},m_{\mathrm{last}})) \triangleright New maximum with last surviving minimum
9:else if un+1<unu_{n+1}<u_{n} then \triangleright Input falls: update minima
10:  while Πn\Pi_{n}\neq\emptyset and mtop>un+1m_{\mathrm{top}}>u_{n+1} do
11:   MlastMtopM_{\mathrm{last}}\leftarrow M_{\mathrm{top}} \triangleright Save last maximum before popping
12:   pop(Πn)\mathrm{pop}(\Pi_{n}) \triangleright Wiping
13:  end while
14:  push(Πn,(Mlast,un+1))\mathrm{push}(\Pi_{n},(M_{\mathrm{last}},u_{n+1})) \triangleright New minimum with last surviving maximum
15:end if
16:return Πn+1Πn\Pi_{n+1}\leftarrow\Pi_{n}
Proposition 2.8 (Stack update complexity).

Algorithm˜1 runs in amortised O(1)O(1) time per step — each element is pushed and popped at most once — and requires O(k)O(k) space where knk\leq n is the current stack depth. The relay state γ^αβ[u](n)\hat{\gamma}_{\alpha\beta}[u](n) can be read from Πn\Pi_{n} in O(logk)O(\log k) time by binary search.

2.3  Transformer Attention

For completeness we recall standard definitions. Given query qdq\in\mathbb{R}^{d}, keys Kn×dK\in\mathbb{R}^{n\times d} and values Vn×dV\in\mathbb{R}^{n\times d}, scaled dot-product attention is:

Attn(q,K,V)=softmax(qKd)V.\operatorname{Attn}(q,K,V)=\mathrm{softmax}\!\left(\frac{qK^{\top}}{\sqrt{d}}\right)V. (4)

Hard attention replaces softmax with argmax: aj=𝟏[j=argmaxiqki]a_{j}=\mathbf{1}[j=\operatorname*{arg\,max}_{i}q\cdot k_{i}].

We use the transformer model of Pérez et al. (2021): an LL-layer encoder-decoder with hard attention, sinusoidal position encoding, layer normalisation, and residual connections.

3  Preisach Attention Layer

3.1  Definition

Definition 3.1 (Preisach Attention Layer).

Let u0:nn+1u_{0:n}\in\mathbb{R}^{n+1} be a scalar input sequence and μ={μij}ijL×L\mu=\{\mu_{ij}\}_{i\geq j}\in\mathbb{R}^{L\times L} a discretised measure on the grid 𝒢L={(αi,βj):ij,i,j{1,,L}}\mathcal{G}_{L}=\{(\alpha_{i},\beta_{j}):i\geq j,\,i,j\in\{1,\ldots,L\}\} with αi=iΔ\alpha_{i}=i\Delta, βj=jΔ\beta_{j}=j\Delta. The Preisach Attention Layer is:

PALμ(u0:n)=ijμijγ^αiβj[u](n).\operatorname{PAL}_{\mu}(u_{0:n})=\sum_{i\geq j}\mu_{ij}\cdot\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n)\in\mathbb{R}. (5)
Definition 3.2 (Multi-head PAL).

Let x0:n(n+1)×dx_{0:n}\in\mathbb{R}^{(n+1)\times d} be a vector-valued sequence. Multi-head PAL with HH heads is:

MPAL(x0:n)=h=1HWO(h)PALμ(h)(WI(h)x0:n)d,\operatorname{MPAL}(x_{0:n})=\sum_{h=1}^{H}W_{O}^{(h)}\cdot\operatorname{PAL}_{\mu^{(h)}}\!\left(W_{I}^{(h)}x_{0:n}\right)\in\mathbb{R}^{d}, (6)

where WI(h)1×dW_{I}^{(h)}\in\mathbb{R}^{1\times d} projects to a scalar signal and WO(h)dW_{O}^{(h)}\in\mathbb{R}^{d} projects back to the model dimension. The scalar projection is a simplifying assumption of the present work; Section 10 discusses the vector extension (vPAL) where WI(h)2×dW_{I}^{(h)}\in\mathbb{R}^{2\times d} and the measure μ\mu is defined over the three-dimensional space (α,β,θ)(\alpha,\beta,\theta) following Frydrych (2019).

Definition 3.3 (PAL-Transformer).

An LL-layer PAL-Transformer is defined by:

zn(0)\displaystyle z_{n}^{(0)} =xn+PE(n),\displaystyle=x_{n}+\mathrm{PE}(n), (7)
zn(l)\displaystyle z_{n}^{(l)} =LN(zn(l1)+MPAL(l)(z0:n(l1))),\displaystyle=\mathrm{LN}\!\left(z_{n}^{(l-1)}+\operatorname{MPAL}^{(l)}(z_{0:n}^{(l-1)})\right), (8)
zn(l)\displaystyle z_{n}^{(l)} =LN(zn(l)+MLP(l)(zn(l))),\displaystyle=\mathrm{LN}\!\left(z_{n}^{(l)}+\mathrm{MLP}^{(l)}(z_{n}^{(l)})\right), (9)

where PE\mathrm{PE} is sinusoidal position encoding, LN\mathrm{LN} is layer normalisation, and MLP(l)\mathrm{MLP}^{(l)} is a two-layer network with ReLU.

3.2  Connection to Classical Preisach Operator

Proposition 3.4 (PAL as discretised Preisach).

PALμ(u0:n)\operatorname{PAL}_{\mu}(u_{0:n}) is the discretised Preisach operator 𝒫μΔ[u](n)\mathcal{P}_{\mu_{\Delta}}[u](n) where μΔ=ijμijδ(αi,βj)\mu_{\Delta}=\sum_{i\geq j}\mu_{ij}\delta_{(\alpha_{i},\beta_{j})} is an atomic measure concentrated on the grid 𝒢L\mathcal{G}_{L}. As LL\to\infty and Δ0\Delta\to 0 with LΔ=CL\Delta=C fixed, PALμ𝒫μ\operatorname{PAL}_{\mu}\to\mathcal{P}_{\mu} uniformly on C([0,C])C([0,C]) by proposition˜2.5.

3.3  Relationship to Standard Attention

Proposition 3.5 (Attention as continuous relaxation of PAL).

Standard softmax attention (4) is a continuous relaxation of PAL in which:

  1. 1.

    The binary relay γ^αβ[u](n){0,1}\hat{\gamma}_{\alpha\beta}[u](n)\in\{0,1\} is replaced by the soft weight aj=softmax(qkj/d)(0,1)a_{j}=\mathrm{softmax}(q\cdot k_{j}/\sqrt{d})\in(0,1),

  2. 2.

    The measure μ(α,β)\mu(\alpha,\beta) is replaced by the value matrix VV,

  3. 3.

    Rate-independence is broken: attention depends on absolute position through qkjq\cdot k_{j}, while PAL depends only on the sequence of extrema.

4  Turing Completeness

We prove that a single-layer PAL-Transformer is Turing-complete by showing it can simulate an arbitrary two-stack pushdown automaton (2-PDA), which is known to be equivalent to a Turing machine (Hopcroft and Ullman, 1979).

4.1  Encoding Alphabet Symbols in the Extremum Stack

Definition 4.1 (Cantor-depth encoding).

Let Γ={a1,,ak}\Gamma=\{a_{1},\ldots,a_{k}\} be a stack alphabet, DmaxD_{\max} a depth bound, Δ>0\Delta>0 a resolution parameter, and Cmax=Dmax(k+1)ΔC_{\max}=D_{\max}(k+1)\Delta. The Cantor-depth encoding is:

code(ai,d)=Cmax[d(k+1)+i]Δ,i{1,,k},d{0,,Dmax}.\mathrm{code}^{*}(a_{i},d)=C_{\max}-[d(k+1)+i]\Delta,\quad i\in\{1,\ldots,k\},\;d\in\{0,\ldots,D_{\max}\}. (10)
Lemma 4.2 (Strict monotonicity).

For all i,i{1,,k}i,i^{\prime}\in\{1,\ldots,k\} and d0d\geq 0: code(ai,d+1)<code(ai,d)\mathrm{code}^{*}(a_{i^{\prime}},d+1)<\mathrm{code}^{*}(a_{i},d).

Proof.

code(ai,d)code(ai,d+1)=[(d+1)(k+1)+id(k+1)i]Δ=[(k+1)+ii]Δ[(k+1)k]Δ=Δ>0\mathrm{code}^{*}(a_{i},d)-\mathrm{code}^{*}(a_{i^{\prime}},d+1)=[(d+1)(k+1)+i^{\prime}-d(k+1)-i]\Delta=[(k+1)+i^{\prime}-i]\Delta\geq[(k+1)-k]\Delta=\Delta>0. ∎

Lemma 4.3 (Stack operations via signal generation).

Under encoding (10), the three stack operations are realised by the following signal emissions, without triggering the wiping property erroneously:

PUSH(ai):\displaystyle\mathrm{PUSH}(a_{i})\colon\quad un+1=code(ai,d+1)+ε,un+2=code(ai,d+1)ε,\displaystyle u_{n+1}=\mathrm{code}^{*}(a_{i},d+1)+\varepsilon,\quad u_{n+2}=\mathrm{code}^{*}(a_{i},d+1)-\varepsilon, (11)
POP:\displaystyle\mathrm{POP}\colon\quad un+1=Mtop1+2ε,\displaystyle u_{n+1}=M_{\mathrm{top}-1}+2\varepsilon, (12)
TOP:\displaystyle\mathrm{TOP}\colon\quad read ai=Dec(Mtop),no signal emitted.\displaystyle\text{read }a_{i}=\operatorname{Dec}(M_{\mathrm{top}}),\quad\text{no signal emitted.} (13)
Proof.

PUSH: By lemma˜4.2, code(ai,d+1)<code(aj,d)=Mtop\mathrm{code}^{*}(a_{i},d+1)<\mathrm{code}^{*}(a_{j},d)=M_{\mathrm{top}} for all jj, so un+1=code(ai,d+1)+ε<Mtopu_{n+1}=\mathrm{code}^{*}(a_{i},d+1)+\varepsilon<M_{\mathrm{top}}. The wiping condition (un+1>Mtopu_{n+1}>M_{\mathrm{top}}) is not triggered; a new pair (Md+1,md+1)(M_{d+1},m_{d+1}) is added to the stack top.

POP: Since Mtop1>MtopM_{\mathrm{top}-1}>M_{\mathrm{top}} (strict monotonicity of stack), un+1>Mtopu_{n+1}>M_{\mathrm{top}} triggers wiping, removing (Mtop,mtop)(M_{\mathrm{top}},m_{\mathrm{top}}). The signal un+1u_{n+1} itself satisfies un+1=Mtop1+2ε<Mtop2u_{n+1}=M_{\mathrm{top}-1}+2\varepsilon<M_{\mathrm{top}-2} (if depth 3\geq 3), so only one pair is removed.

TOP: Dec(Mtop)=ai\operatorname{Dec}(M_{\mathrm{top}})=a_{i} where i=(CmaxMtop)/Δmod(k+1)i=\lfloor(C_{\max}-M_{\mathrm{top}})/\Delta\rfloor\mod(k+1), which is a deterministic function of MtopM_{\mathrm{top}} computable by an MLP. ∎

4.2  Main Theorem

Definition 4.4 (Autoregressive PAL-Transformer).

The Turing-completeness result of theorem˜4.6 is stated for the autoregressive (closed-loop) operational regime of the PAL-Transformer, distinct from the parallel encoder of definition˜3.3. In the autoregressive regime:

  1. 1.

    At step nn, the PAL-Transformer receives the current input token xnx_{n} (which may itself be a function of previous outputs) and produces output znz_{n}.

  2. 2.

    A component of znz_{n} — specifically, the signal channels un+1(1:4)u^{(1:4)}_{n+1} — is appended as the next input token xn+1x_{n+1}, creating a closed generative loop.

  3. 3.

    The extremum stacks Πn(1:4)\Pi^{(1:4)}_{n} persist across steps as the sole recurrent state; no other hidden state is maintained.

This regime corresponds to standard autoregressive language model inference. The result is a generation theorem (unbounded computation traces can be produced), not merely a verification theorem (pre-encoded traces classified). The parallel encoder definition (definition˜3.3) is used for training and for the expressiveness results of section˜5, where the full input sequence is available.

Lemma 4.5 (Multi-head PAL decodes the extremum stack).

For any extremum stack Πn\Pi_{n} of depth at most kk over a discretised alphabet of size K=(k+1)|Γ|K=(k+1)|\Gamma|, there exist measures μ(1),,μ(H)\mu^{(1)},\ldots,\mu^{(H)} with H=log2KH=\lceil\log_{2}K\rceil heads such that the map

ϕ:Πn(PALμ(1)(u0:n),,PALμ(H)(u0:n)){0,1}H\phi:\Pi_{n}\;\mapsto\;\bigl(\operatorname{PAL}_{\mu^{(1)}}(u_{0:n}),\ldots,\operatorname{PAL}_{\mu^{(H)}}(u_{0:n})\bigr)\in\{0,1\}^{H} (14)

is injective. In particular, MtopM_{\mathrm{top}} (the top-of-stack element) is a deterministic function of the HH-dimensional MPAL output, computable by a two-layer MLP.

Proof.

Under the Cantor-depth encoding (definition˜4.1), each stack element aia_{i} at depth dd has a unique scalar value code(ai,d)=Cmax[d(k+1)+i]Δ\mathrm{code}^{*}(a_{i},d)=C_{\max}-[d(k+1)+i]\Delta. The total number of distinct values is K=(Dmax+1)(k+1)K=(D_{\max}+1)(k+1), each separated by Δ>0\Delta>0.

Construct H=log2KH=\lceil\log_{2}K\rceil PAL heads with indicator measures: μij(h)=bh(index(i,j))\mu^{(h)}_{ij}=b_{h}(\mathrm{index}(i,j)), where index(i,j)\mathrm{index}(i,j) maps threshold pair (αi,βj)(\alpha_{i},\beta_{j}) to the corresponding Cantor-depth code, and bh()b_{h}(\cdot) extracts the hh-th bit of the binary representation of that code.

Each head hh outputs PALμ(h)=ijμij(h)γ^αiβj[u](n)\operatorname{PAL}_{\mu^{(h)}}=\sum_{ij}\mu^{(h)}_{ij}\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n), which equals the hh-th bit of the binary encoding of the top-of-stack Cantor code when the measure is concentrated on the relay active at the stack top.

The HH-bit vector uniquely identifies MtopM_{\mathrm{top}} via binary decoding — a linear operation implementable by a single MLP layer. The full stack is decoded by repeating this procedure with a mask that deactivates the top relay after reading it (POP operation via the signal generation of lemma˜4.3). ∎

Theorem 4.6 (Turing completeness of PAL-Transformer).

For any two-stack pushdown automaton M=(Q,Σ,Γ,δ,q0,Z0,F)M=(Q,\Sigma,\Gamma,\delta,q_{0},Z_{0},F), there exists a single-layer PAL-Transformer 𝒯\mathcal{T} with H=4H=4 MPAL heads, one MLP layer, and arithmetic precision O(log(n|Γ|))O(\log(n\cdot|\Gamma|)) bits that simulates MM step-by-step on inputs of length nn.

Proof.

We construct 𝒯\mathcal{T} with four independent signal channels u(1),u(2),u(3),u(4)u^{(1)},u^{(2)},u^{(3)},u^{(4)}, each processed by a dedicated MPAL head.

Channel 1 — Machine state.

un(1)=codeQ(qn)u^{(1)}_{n}=\mathrm{code}_{Q}(q_{n}) encodes the current state qnQq_{n}\in Q as a scalar. Updated directly at each step by the MLP after computing δ\delta.

Channel 2 — Stack 1.

u0:n(2)u^{(2)}_{0:n} encodes the contents of stack 1 via lemma˜4.3. The extremum stack Πn(2)\Pi^{(2)}_{n} is a bijective encoding of the stack-1 contents.

Channel 3 — Stack 2.

u0:n(3)u^{(3)}_{0:n} encodes stack 2 analogously.

Channel 4 — Input tape.

u0:n(4)u^{(4)}_{0:n} encodes the input word; position encoding allows reading symbol xtx_{t} at step tt via an MPAL head with WI(4)W_{I}^{(4)} tuned to isolate position tt (cf. Pérez et al. (2021), Lemma 4).

Simulation step nn+1n\to n+1.

At each step the PAL-Transformer performs:

  1. 1.

    Read: qnDec(head1)q_{n}\leftarrow\operatorname{Dec}(\mathrm{head}_{1}), anhead4a_{n}\leftarrow\mathrm{head}_{4}, sn(1)TOP(Πn(2))s^{(1)}_{n}\leftarrow\mathrm{TOP}(\Pi^{(2)}_{n}), sn(2)TOP(Πn(3))s^{(2)}_{n}\leftarrow\mathrm{TOP}(\Pi^{(3)}_{n}).

  2. 2.

    Transition: (qn+1,ops1,ops2)MLPδ(qn,an,sn(1),sn(2))(q_{n+1},\mathrm{ops}_{1},\mathrm{ops}_{2})\leftarrow\mathrm{MLP}_{\delta}(q_{n},a_{n},s^{(1)}_{n},s^{(2)}_{n}). The MLP implements δ\delta by tabulation (finite domain, bounded width by Cybenko 1989).

  3. 3.

    Write: Emit signals un+1(2)u^{(2)}_{n+1} and un+1(3)u^{(3)}_{n+1} per ops1\mathrm{ops}_{1} and ops2\mathrm{ops}_{2} via lemma˜4.3.

  4. 4.

    Accept: MLPF\mathrm{MLP}_{F} checks qn+1Fq_{n+1}\in F, Π(2)=Z0\Pi^{(2)}=Z_{0}, Π(3)=Z0\Pi^{(3)}=Z_{0}.

Correctness.

By induction: at step nn, stacks Πn(2)\Pi^{(2)}_{n} and Πn(3)\Pi^{(3)}_{n} faithfully encode the 2-PDA configuration (qn,s1:m1(1),s1:m2(2))(q_{n},s^{(1)}_{1:m_{1}},s^{(2)}_{1:m_{2}}). The base case is the initial configuration q0,Z0,Z0q_{0},Z_{0},Z_{0}. The inductive step follows from lemmas˜4.3 and 4.2.

Since 2-PDA \equiv Turing machine (Hopcroft and Ullman, 1979), 𝒯\mathcal{T} is Turing-complete.

Depth.

The construction uses L=1L=1 MPAL layer and L=1L=1 MLP layer — total depth O(1)O(1), independent of nn.

Precision.

Encoding depth dnd\leq n and alphabet size k=|Γ|k=|\Gamma| requires distinguishing values spaced Δ\Delta apart up to Cmax=n(k+1)ΔC_{\max}=n(k+1)\Delta, needing O(log(nk))O(\log(nk)) bits. ∎

Corollary 4.7 (Depth separation from Transformer).

There exists a language LL recognisable by a 1-layer PAL-Transformer that requires Ω(logn)\Omega(\log n) layers for any transformer with O(1)O(1) heads (under the circuit complexity lower bounds of Furst et al. (1984)).

Corollary 4.8 (Vector PAL is TC with a single head).

A single-layer vector PAL-Transformer (vPAL) with projection WI2×dW_{I}\in\mathbb{R}^{2\times d}, a single head (H=1H=1), one MLP layer, and arithmetic precision O(log(n|Γ|))O(\log(n\cdot|\Gamma|)) bits is Turing-complete.

Proof.

The two-dimensional signal 𝐮n=(unx,uny)2\mathbf{u}_{n}=(u_{n}^{x},u_{n}^{y})\in\mathbb{R}^{2} admits two independent extremum stacks:

Πnx=UpdateStack(Πn1x,unx),Πny=UpdateStack(Πn1y,uny).\Pi^{x}_{n}=\mathrm{UpdateStack}(\Pi^{x}_{n-1},\,u_{n}^{x}),\qquad\Pi^{y}_{n}=\mathrm{UpdateStack}(\Pi^{y}_{n-1},\,u_{n}^{y}). (15)

We assign unxu_{n}^{x} to encode stack 1 of the 2-PDA and unyu_{n}^{y} to encode stack 2, both using the Cantor-depth encoding of definition˜4.1. The machine state qnQq_{n}\in Q and input tape are encoded in the angular structure of the vPAL measure μ(αx,βx,αy,βy,θ)\mu(\alpha_{x},\beta_{x},\alpha_{y},\beta_{y},\theta): distinct sectors θ[0,2π)\theta\in[0,2\pi) correspond to distinct states qQq\in Q, following the superposition of Preisach half-planes introduced by Frydrych (2019) for two-axis fluxgate sensors (Chapter 4.2.2 of Frydrych 2019). The MLP reads TOP(Πnx)\mathrm{TOP}(\Pi^{x}_{n}) and TOP(Πny)\mathrm{TOP}(\Pi^{y}_{n}) simultaneously from the single vPAL head and computes the transition δ\delta exactly as in theorem˜4.6. Correctness and depth O(1)O(1) follow identically. Since H=1H=1 head carries both stacks via the two signal dimensions, the head count is reduced from H=4H=4 to H=1H=1 at the cost of replacing the scalar projection WI1×dW_{I}\in\mathbb{R}^{1\times d} with the vector projection WI2×dW_{I}\in\mathbb{R}^{2\times d}. ∎

Remark 4.9 (Differentiability and practical training).

The binary relay γ^αβ\hat{\gamma}_{\alpha\beta} is discontinuous in both the input unu_{n} and the thresholds (α,β)(\alpha,\beta), making exact backpropagation undefined at switching points. Practical implementations require a smooth relaxation, e.g. the stateful sigmoid relaxation sn+1=sn[1στ(βun)]+(1sn)στ(unα)s_{n+1}=s_{n}[1-\sigma_{\tau}(\beta-u_{n})]+(1-s_{n})\sigma_{\tau}(u_{n}-\alpha) with temperature τ0\tau\to 0 during training (straight-through estimator or curriculum annealing). The theoretical results of this paper hold for the exact binary relay; the relaxed version is studied empirically in companion work.

Remark 4.10 (Position encoding and rate-independence).

Definition 3.3 includes sinusoidal position encoding PE(n)\mathrm{PE}(n), which may appear to contradict rate-independence. The tension is deliberate: PE\mathrm{PE} is used by the MLP layers (to implement positional logic, e.g. reading the input tape in the TC proof) but is not passed through the MPAL heads. The MPAL output itself — the integral over active relays — remains rate-independent. Rate-independence therefore applies to the PAL component of the architecture, not to the full PAL-Transformer. This is consistent with theorem˜5.7, which characterises the function class of the PAL heads, and with the expressiveness separation, which constructs functions PAL heads cannot compute.

Remark 4.11 (Heads vs. signal dimensions as exchangeable resources).

corollary˜4.8 formalises the intuition that in PAL, the number of heads HH and the signal dimension dud_{u} are exchangeable: HH scalar heads can be replaced by a single head with HH-dimensional projection. This exchangeability does not hold for standard multi-head attention, where each head learns an independent linear projection without the hysteretic coupling that links dimensions in vPAL. Specifically, HH scalar PAL heads require HL2H\cdot L^{2} parameters for the measure; a single HH-dimensional vPAL head requires L2HL^{2H} parameters (the HH-fold Cartesian product of Preisach half-planes), which grows exponentially. The minimal TC architecture is therefore scalar PAL at H=2H=2 (two stacks, two heads) or vPAL at H=1H=1 (two stacks, one head with du=2d_{u}=2).

5  Expressiveness Separation

We prove the central incomparability result.

Theorem 5.1 (Expressiveness incomparability).

Let PAL\mathcal{F}_{\operatorname{PAL}} and Tr\mathcal{F}_{\mathrm{Tr}} be the function classes computable by bounded-depth PAL-Transformer and Transformer respectively. Then:

PALTrandTrPAL.\mathcal{F}_{\operatorname{PAL}}\setminus\mathcal{F}_{\mathrm{Tr}}\neq\emptyset\qquad\text{and}\qquad\mathcal{F}_{\mathrm{Tr}}\setminus\mathcal{F}_{\operatorname{PAL}}\neq\emptyset. (16)

5.1  Functions PAL computes but Transformer cannot (at bounded depth)

Proposition 5.2 (Historical range in O(1)O(1) layers).

The function frange(u0:n)=maxinuimininuif_{\mathrm{range}}(u_{0:n})=\max_{i\leq n}u_{i}-\min_{i\leq n}u_{i} is computable by a 1-layer PAL in O(1)O(1) time.

Proof.

M1=maxinuiM_{1}=\max_{i\leq n}u_{i} and mk=mininuim_{k}=\min_{i\leq n}u_{i} are directly readable from the extremum stack Πn\Pi_{n} as the first maximum and last minimum. frange=M1mkf_{\mathrm{range}}=M_{1}-m_{k}. ∎

Proposition 5.3 (Transformer requires Ω(logn)\Omega(\log n) layers for range).

Any transformer with O(1)O(1) layers and O(1)O(1) heads cannot compute frangef_{\mathrm{range}} exactly on sequences of length nn.

Proof.

MAX\mathrm{MAX} over nn values is not in AC0\mathrm{AC}^{0} (Håstad, 1987) — it cannot be computed by constant-depth Boolean circuits of polynomial size. A constant-depth transformer with O(1)O(1) heads is equivalent to an AC0\mathrm{AC}^{0} circuit (cf. Hahn (2020)), so it cannot compute MAX\mathrm{MAX} exactly. Since frangef_{\mathrm{range}} requires MAX\mathrm{MAX} as a subcomputation, the claim follows. ∎

5.2  Functions Transformer computes but PAL cannot

Proposition 5.4 (Random-access retrieval).

The function fcopy(x0:n,p)=xpf_{\mathrm{copy}}(x_{0:n},p)=x_{p} (retrieve token at position pp) is computable by a 1-layer hard-attention Transformer in O(1)O(1) layers.

Proof.

Set qn=PE(p)q_{n}=\mathrm{PE}(p) and kj=PE(j)k_{j}=\mathrm{PE}(j). Hard attention selects j=argmaxjqnkj=pj^{*}=\operatorname*{arg\,max}_{j}q_{n}\cdot k_{j}=p. The output is vp=xpv_{p}=x_{p}. ∎

Proposition 5.5 (PAL cannot perform exact random access).

No PAL-Transformer (of any depth or head count) can compute fcopy(x0:n,p)=xpf_{\mathrm{copy}}(x_{0:n},p)=x_{p} for all sequences and positions.

Proof.

We give a structural impossibility proof based on rate-independence (propositions˜2.3 and 5.7).

Construction.

Fix any target position p{1,,n1}p\in\{1,\ldots,n-1\} and any value vvv\neq v^{\prime}. Define two sequences that differ only at position pp:

𝐱\displaystyle\mathbf{x} =(x0,,xp1,v,xp+1,,xn),\displaystyle=(x_{0},\ldots,x_{p-1},\,v,\,x_{p+1},\ldots,x_{n}), (17)
𝐱\displaystyle\mathbf{x}^{\prime} =(x0,,xp1,v,xp+1,,xn),\displaystyle=(x_{0},\ldots,x_{p-1},\,v^{\prime},\,x_{p+1},\ldots,x_{n}), (18)

where the surrounding values xp1x_{p-1} and xp+1x_{p+1} satisfy xp1<min(v,v)<max(v,v)<xp+1x_{p-1}<\min(v,v^{\prime})<\max(v,v^{\prime})<x_{p+1} (i.e. pp is in a strictly monotone ascending segment of both sequences).

Extremum stacks are identical.

Since position pp lies in a monotone ascending segment of both sequences, it is not a local extremum of either 𝐱\mathbf{x} or 𝐱\mathbf{x}^{\prime}. Therefore Πn(𝐱)=Πn(𝐱)\Pi_{n}(\mathbf{x})=\Pi_{n}(\mathbf{x}^{\prime}): the extremum stacks are identical, because only positions jj where uju_{j} is a local extremum enter the stack (algorithm˜1), and those positions are the same in both sequences.

Rate-independence forces identical outputs.

By theorem˜5.7, every PAL-Transformer computes a rate-independent function, hence a function of Πn\Pi_{n} alone (proposition˜2.7). Since Πn(𝐱)=Πn(𝐱)\Pi_{n}(\mathbf{x})=\Pi_{n}(\mathbf{x}^{\prime}), any PAL-Transformer outputs the same value on 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime}. But fcopy(𝐱,p)=vv=fcopy(𝐱,p)f_{\mathrm{copy}}(\mathbf{x},p)=v\neq v^{\prime}=f_{\mathrm{copy}}(\mathbf{x}^{\prime},p). Therefore no PAL-Transformer computes fcopyf_{\mathrm{copy}}. ∎

Remark 5.6 (Why the probabilistic argument is incorrect).

An earlier version of this proof argued that position pp is a local extremum with probability Θ(1/n)\Theta(1/n) for a random sequence, making fcopyf_{\mathrm{copy}} inaccessible with high probability. This is incorrect: for i.i.d. sequences from a continuous distribution, an interior position is a local extremum (maximum or minimum) with probability 2/32/3 by the symmetry of the six orderings of (up1,up,up+1)(u_{p-1},u_{p},u_{p+1}). The correct argument is the structural one above, which applies universally and does not rely on any probabilistic assumption about the input.

5.3  The separating property: rate-independence

Theorem 5.7 (Rate-independence as separating property).

A function f:n+1f:\mathbb{R}^{n+1}\to\mathbb{R} is computable by a PAL-Transformer only if it is rate-independent in the sense of proposition˜2.3. Standard attention computes functions that are not rate-independent. Hence rate-independence is a necessary condition for membership in PAL\mathcal{F}_{\operatorname{PAL}} and a sufficient condition for fTrPALf\notin\mathcal{F}_{\mathrm{Tr}}\setminus\mathcal{F}_{\operatorname{PAL}}.

Proof.

PAL is a composition of rate-independent operators (relay + linear combination) and rate-independent nonlinearities (ReLU is rate-independent). By closure under composition, any function computable by a PAL-Transformer is rate-independent.

Standard attention is not rate-independent: the attention weight aj=softmax(qnkj/d)a_{j}=\mathrm{softmax}(q_{n}\cdot k_{j}/\sqrt{d}) depends explicitly on position through kj=WK(xj+PE(j))k_{j}=W_{K}(x_{j}+\mathrm{PE}(j)), so inserting a time-rescaling ϕ\phi changes the output. ∎

6  Logical Characterisation

Barceló et al. (2020) showed that soft-attention transformers correspond to FO\mathrm{FO} with aggregate functions — a fragment of first-order logic over sequences. We give an analogous characterisation for PAL.

Definition 6.1 (Extremum First-Order Logic, EFO\mathrm{EFO}).

EFO\mathrm{EFO} is the extension of first-order logic over sequences with:

  1. 1.

    Quantification over extremal positions: extiϕ(i)\exists^{\mathrm{ext}}i\,\phi(i) asserts that there exists a local extremum position ii satisfying ϕ\phi.

  2. 2.

    An extremum aggregate operator: ExtAggi[f(i)]\mathrm{ExtAgg}_{i}[f(i)] sums f(i)f(i) over all extremal positions weighted by μ\mu.

  3. 3.

    No quantification over arbitrary positions.

Theorem 6.2 (PAL corresponds to EFO).

A function f:Σf:\Sigma^{*}\to\mathbb{R} is computable by a bounded-depth PAL-Transformer if and only if it is definable in EFO\mathrm{EFO}.

Proof sketch..

(\Rightarrow) Each MPAL head computes i:ui is extremumμijγ^αiβj\sum_{i:u_{i}\text{ is extremum}}\mu_{ij}\hat{\gamma}_{\alpha_{i}\beta_{j}} — an extremum aggregate. Composition through MLP layers corresponds to Boolean combinations of EFO\mathrm{EFO} formulae.

(\Leftarrow) Each EFO\mathrm{EFO} formula can be implemented by appropriate choice of μ(h)\mu^{(h)} and WI(h)W_{I}^{(h)} in MPAL. The extremal quantifier is implemented by the relay’s dead-band — only extremal positions cause relay state changes, so only extremal positions contribute to the sum. Full proof by structural induction on EFO\mathrm{EFO} formula complexity; details in appendix˜A. ∎

Remark 6.3 (Comparison with FO + Aggregate).

EFOFO+Aggregate\mathrm{EFO}\subsetneq\mathrm{FO}+\mathrm{Aggregate}: EFO\mathrm{EFO} cannot quantify over arbitrary positions (no random access), while FO+Aggregate\mathrm{FO}+\mathrm{Aggregate} can. Conversely, EFO\mathrm{EFO} can express frangef_{\mathrm{range}} directly as ExtAgg[𝟏[i=globalmax]ui]ExtAgg[𝟏[i=globalmin]ui]\mathrm{ExtAgg}[\mathbf{1}[i=\mathrm{global\ max}]\cdot u_{i}]-\mathrm{ExtAgg}[\mathbf{1}[i=\mathrm{global\ min}]\cdot u_{i}], which has no bounded-depth FO\mathrm{FO} definition.

7  Computational Complexity

Theorem 7.1 (Complexity of PAL inference).

For a single-layer HH-head PAL-Transformer processing a sequence of length nn:

  1. 1.

    Time per token: O(Hlogkd)O(H\cdot\log k\cdot d) where knk\leq n is extremum stack depth, dd model dimension.

  2. 2.

    Total time for sequence of length nn: O(Hnlognd)O(H\cdot n\log n\cdot d).

  3. 3.

    Memory: O(Hkd)O(H\cdot k\cdot d).

Compare with standard attention: time O(n2d)O(n^{2}d), memory O(nd)O(nd).

Table 1: Complexity and expressiveness comparison. nn = sequence length, dd = model dimension, knk\leq n = extremum stack depth, dud_{u} = signal dimension.
Architecture Time (total) Memory Depth for TC Heads for TC
Transformer (softmax) O(n2d)O(n^{2}d) O(nd)O(nd) not TC
Transformer (hard) O(n2d)O(n^{2}d) O(nd)O(nd) O(logn)O(\log n) O(1)O(1)
Mamba / SSM O(nd)O(nd) O(d)O(d) open
RWKV O(nd)O(nd) O(d)O(d) open
PAL (scalar, du=1d_{u}=1) 𝐎(𝐧log𝐧𝐝)\mathbf{O(n\log n\cdot d)} 𝐎(𝐤𝐝)\mathbf{O(kd)} 𝐎(𝟏)\mathbf{O(1)} H=4H=4
vPAL (vector, du=2d_{u}=2) 𝐎(𝐧log𝐧𝐝)\mathbf{O(n\log n\cdot d)} 𝐎(𝐤𝐝)\mathbf{O(kd)} 𝐎(𝟏)\mathbf{O(1)} 𝐇=𝟏\mathbf{H=1}
Predicted task advantages.

By theorems˜5.1 and 5.7, PAL is predicted to outperform standard attention on tasks that are rate-independent and require long episodic memory:

  1. 1.

    Tracking entity states across long documents (who did what to whom).

  2. 2.

    Detecting anomalies in time series (historical range, running extrema).

  3. 3.

    Reasoning over ordered events without positional sensitivity (logical puzzles, symbolic reasoning).

  4. 4.

    Energy market dispatch where decision thresholds rather than price histories determine behaviour (Frydrych and Szewczyk, 2014).

8  Related Work

Expressiveness of transformers.

Pérez et al. (2021) proved Turing completeness of hard-attention transformers at depth O(logn)O(\log n). Hahn (2020) and Barceló et al. (2020) characterised soft-attention transformers as FO\mathrm{FO} with aggregate functions. Merrill and Sabharwal (2023) established circuit complexity bounds separating transformer classes. PAL is a new point in this space, incomparable to existing architectures.

Alternative memory mechanisms.

Mamba (Gu and Dao, 2023) uses selective SSMs with linear recurrence and input-dependent forgetting. RWKV (Peng et al., 2023) replaces attention with time-decay weighted recurrence. Titans (Behrouz et al., 2025) learns long-term memory through online gradient descent. All use temporal (recency-based) forgetting; PAL uses significance-based forgetting through wiping.

Hysteresis in machine learning.

Frydrych and Szewczyk (2014); Frydrych (2019) applied Preisach-type models to financial market modelling. Barroso et al. (2015) used Preisach operators for battery electrochemical modelling. To our knowledge, this is the first work to use the Preisach operator as a sequence modelling layer in neural networks.

Sparse and efficient attention.

Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) implement sparse attention patterns. PAL achieves a different kind of sparsity — extremal sparsity — where only local maxima and minima contribute to the output, as opposed to sliding windows or random patterns.

9  Connection to the Random-Field Ising Model

9.1  The Preisach–RFIM Equivalence

The connection between the Preisach operator and the Random-Field Ising Model (RFIM) has been established in the physics literature (Sethna et al., 2006; Dahmen and Sethna, 1993), and provides a bridge between PAL and a broad class of combinatorial and statistical problems.

The mean-field RFIM at T=0T=0 consists of NN Ising spins σi{1,+1}\sigma_{i}\in\{-1,+1\} with Hamiltonian:

=J2Nijσiσji(hi+H)σi,\mathcal{H}=-\frac{J}{2N}\sum_{i\neq j}\sigma_{i}\sigma_{j}-\sum_{i}(h_{i}+H)\sigma_{i}, (19)

where J>0J>0, HH is a uniform external field, and {hi}\{h_{i}\} are independent random fields drawn from distribution 𝒫(h)\mathcal{P}(h).

At T=0T=0, each spin satisfies the single-spin stability condition: σi=sign(Jm+hi+H)\sigma_{i}=\operatorname{sign}(Jm+h_{i}+H), where m=N1iσim=N^{-1}\sum_{i}\sigma_{i} is the mean magnetisation. Spin σi\sigma_{i} flips from 1-1 to +1+1 when HH crosses the activation threshold αi=hiJm\alpha_{i}=-h_{i}-Jm from below, and flips back when HH crosses the deactivation threshold βi=hi+Jm\beta_{i}=-h_{i}+Jm from above.

Proposition 9.1 (Preisach–RFIM equivalence).

The mean-field RFIM at T=0T=0, driven quasi-statically by external field HH, is equivalent to a Preisach operator 𝒫μ[H]\mathcal{P}_{\mu}[H] with measure:

μ(α,β)=𝒫(α+β2)δ(αβ2Jm),\mu(\alpha,\beta)=\mathcal{P}\!\left(\frac{\alpha+\beta}{2}\right)\cdot\delta\!\left(\alpha-\beta-2Jm\right), (20)

where the measure is supported on the line αβ=2Jm\alpha-\beta=2Jm within the Preisach half-plane 𝒯\mathcal{T}. The magnetisation m(H)=𝒫μ[H]m(H)=\mathcal{P}_{\mu}[H] satisfies the self-consistency equation m=𝒫μ[H](t)m=\mathcal{P}_{\mu}[H](t).

Proof.

Each spin ii contributes a relay γ^αiβi\hat{\gamma}_{\alpha_{i}\beta_{i}} with αiβi=2Jm\alpha_{i}-\beta_{i}=2Jm (the coupling gap). The output σi=2γ^αiβi[H]1\sigma_{i}=2\hat{\gamma}_{\alpha_{i}\beta_{i}}[H]-1 and the aggregate magnetisation:

m=1Niσi=21Niγ^αiβi[H]1=2𝒫μ[H]1,m=\frac{1}{N}\sum_{i}\sigma_{i}=2\cdot\frac{1}{N}\sum_{i}\hat{\gamma}_{\alpha_{i}\beta_{i}}[H]-1=2\mathcal{P}_{\mu}[H]-1,

where μ\mu is the empirical measure of threshold pairs, converging to (20) as NN\to\infty by the law of large numbers applied to i.i.d. hi𝒫h_{i}\sim\mathcal{P}. ∎

Remark 9.2 (Structural consequences).

proposition˜9.1 implies that the hysteresis loop, subloop structure, and wiping property of the RFIM are exact consequences of the Preisach operator structure. In particular, the return-point memory of the RFIM (Sethna et al., 2006) is precisely the wiping property (proposition˜2.4), and the extremum stack (definition˜2.6) is the minimal sufficient statistic of the RFIM’s history under quasi-static driving.

9.2  PAL as a Learned, Sequential RFIM

Under the Preisach–RFIM equivalence, PAL is a learned, sequential, non-equilibrium generalisation of the RFIM:

Dimension RFIM PAL
Spins fixed binary σi{1,+1}\sigma_{i}\in\{-1,+1\} relay states rk{0,1}r_{k}\in\{0,1\}, learned
Driving signal quasi-static field HH arbitrary sequence u0:nu_{0:n}
Disorder quenched {hi}𝒫\{h_{i}\}\sim\mathcal{P} learned measure μ(α,β)\mu(\alpha,\beta)
Coupling mean-field J/NJ/N implicit through self-consistency of μ\mu
Dynamics equilibrium (lowest energy) causal, rate-independent
Memory return-point memory (history of HH) extremum stack Πn\Pi_{n}
Wiping return-point memory property proposition˜2.4

The key conceptual shift: the RFIM asks "what is the equilibrium configuration at field HH?" while PAL asks "what is the causal output of the sequence u0:nu_{0:n}?" This shift from equilibrium to sequential opens PAL to a broad class of problems where Ising-type models apply but the data arrives as a stream.

9.3  Problems where PAL Inherits Ising Expressiveness

We identify three problem classes where the RFIM–Preisach equivalence suggests PAL has structural advantages.

9.3.1  Sequential Binary Optimisation

The Ising model underlies a broad class of NP-hard combinatorial problems: MAX-CUT, graph colouring, satisfiability, and the Hopfield associative memory (Lucas, 2014). Their energy function has the form:

E(𝝈)=i<jJijσiσjihiσi,E(\boldsymbol{\sigma})=-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j}-\sum_{i}h_{i}\sigma_{i}, (21)

and the goal is 𝝈=argminE\boldsymbol{\sigma}^{*}=\operatorname*{arg\,min}E.

Proposition 9.3 (PAL tracks mean-field Ising energy for streaming interactions).

Let interactions J1,J2,,JnJ_{1},J_{2},\ldots,J_{n}\in\mathbb{R} arrive sequentially as a stream, drawn i.i.d. from a distribution 𝒫J\mathcal{P}_{J} with mean J¯\bar{J} and variance σJ2\sigma_{J}^{2}. Define the scalar signal ut=Jtu_{t}=J_{t}. Then the PAL output with measure μ(α,β)=α𝟏[α=β]\mu^{*}(\alpha,\beta)=\alpha\cdot\mathbf{1}[\alpha=-\beta] satisfies:

PALμ(u0:n)=t=0nJtγ^JtJt[u](t)=E¯n+O(n1/2),\operatorname{PAL}_{\mu^{*}}(u_{0:n})\;=\;\sum_{t=0}^{n}J_{t}\cdot\hat{\gamma}_{J_{t}-J_{t}}[u](t)\;=\;\bar{E}_{n}+O(n^{1/2}), (22)

where E¯n=1Ni<jJijmimj\bar{E}_{n}=-\frac{1}{N}\sum_{i<j}J_{ij}m_{i}m_{j} is the mean-field Ising energy for the empirical mean magnetisation m=𝔼tn[σt]m=\mathbb{E}_{t\leq n}[\sigma_{t}], and the O(n1/2)O(n^{1/2}) error follows from a standard concentration bound on the empirical mean.

Proof.

Each interaction JtJ_{t} contributes to the Ising energy when its relay γ^JtJt[u](t)=1\hat{\gamma}_{J_{t}-J_{t}}[u](t)=1, i.e. when JtJtJ_{t}\geq J_{t} — which holds trivially at the moment of arrival. The relay then tracks whether JtJ_{t} remains the dominant interaction. By the wiping property, relay γ^JtJt\hat{\gamma}_{J_{t}-J_{t}} remains active at time nn iff no subsequent Jt>JtJ_{t^{\prime}}>J_{t} has arrived. The PAL aggregate sums JtJ_{t} over all non-dominated interactions, giving a running estimate of the effective coupling.

For the mean-field case Jij=J/NJ_{ij}=J/N for all pairs, the self-consistency equation m=𝒫(h) 1[h>JmH]𝑑hm=\int\mathcal{P}(h)\,\mathbf{1}[h>-Jm-H]\,dh from proposition˜9.1 holds exactly: PALμ(u0:n)=Jm2/2+O(n1/2)\operatorname{PAL}_{\mu^{*}}(u_{0:n})=Jm^{2}/2+O(n^{-1/2}) by the law of large numbers applied to mm over nn observations, recovering the mean-field energy Jm2/2-Jm^{2}/2. The error bound O(n1/2)O(n^{1/2}) follows from Hoeffding’s inequality applied to the sum of bounded relay outputs. ∎

Remark 9.4 (Relation to MAX-CUT).

The mean-field Ising energy in proposition˜9.3 is related to the MAX-CUT value by CUT(S)=(EmaxE(𝛔))/2\mathrm{CUT}(S)=(E_{\mathrm{max}}-E(\boldsymbol{\sigma}))/2 for the optimal cut SS. However, PAL does not solve MAX-CUT — it tracks the running mean-field energy, which provides a lower bound on the cut value. The advantage is computational: PAL updates this estimate in O(logk)O(\log k) per new interaction, without re-solving from scratch.

9.3.2  Associative Memory with Structured Forgetting

The Hopfield network (Hopfield, 1982) stores PP patterns {𝝃μ}μ=1P\{\boldsymbol{\xi}^{\mu}\}_{\mu=1}^{P} as attractors of Ising dynamics with Jij=N1μξiμξjμJ_{ij}=N^{-1}\sum_{\mu}\xi_{i}^{\mu}\xi_{j}^{\mu}. Retrieval capacity is limited to P<0.14NP<0.14N patterns before catastrophic interference (Amit et al., 1985).

PAL offers a different memory model where forgetting is determined by significance (extremality) rather than capacity:

Proposition 9.5 (PAL as hysteretic associative memory with capacity bound).

Let patterns 𝛏1,,𝛏Pd\boldsymbol{\xi}^{1},\ldots,\boldsymbol{\xi}^{P}\in\mathbb{R}^{d} arrive sequentially with activation strengths ut=𝛏t2u_{t}=\|\boldsymbol{\xi}^{t}\|_{2}. Let kk be the depth of the extremum stack Πn\Pi_{n} after all PP patterns have been processed.

  1. 1.

    Storage: The extremum stack Πn\Pi_{n} retains exactly those kk patterns whose activation strength constitutes a local extremum of the stream u1,,uPu_{1},\ldots,u_{P}. All other patterns are wiped by the wiping property (proposition˜2.4).

  2. 2.

    Non-overlapping threshold support: The kk retained patterns correspond to disjoint threshold pairs (Ms,ms)(M_{s},m_{s}) with MsMsM_{s}\neq M_{s^{\prime}} for sss\neq s^{\prime} (strict monotonicity of the stack, lemma˜4.2), so their relays have disjoint support on 𝒯\mathcal{T} and their contributions to PALμ\operatorname{PAL}_{\mu} are orthogonal in measure space. Note that wiping is a form of interference between patterns: a stronger subsequent pattern erases a prior one. The claim is that the kk surviving patterns do not mutually interfere in the PAL output.

  3. 3.

    Capacity: The maximum number of simultaneously retrievable patterns is exactly kk, bounded by the number of local extrema in u1:Pu_{1:P}:

    Capacity(PAL)=knext(u1:P),\mathrm{Capacity}(\mathrm{PAL})=k\;\leq\;n_{\mathrm{ext}}(u_{1:P}), (23)

    where next(u1:P)n_{\mathrm{ext}}(u_{1:P}) is the number of local extrema in the activation strength sequence. This differs fundamentally from Hopfield capacity 0.14N\approx 0.14N (Amit et al., 1985): PAL capacity is limited by stack depth, not weight-matrix rank.

Proof.

Storage (part 1): Direct from proposition˜2.4. Pattern 𝝃t\boldsymbol{\xi}^{t} is wiped when a subsequent pattern 𝝃t\boldsymbol{\xi}^{t^{\prime}} with 𝝃t2>𝝃t2\|\boldsymbol{\xi}^{t^{\prime}}\|_{2}>\|\boldsymbol{\xi}^{t}\|_{2} arrives, since the new maximum overwrites the old in the extremum stack. Only patterns at local extrema of u1:Pu_{1:P} survive.

Zero interference (part 2): The PAL output is PALμ(u0:n)=ijμijγ^αiβj[u](n)\operatorname{PAL}_{\mu}(u_{0:n})=\sum_{i\geq j}\mu_{ij}\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n). Each retained pattern 𝝃ts\boldsymbol{\xi}^{t_{s}} corresponds to a unique pair (Ms,ms)(M_{s},m_{s}) in Πn\Pi_{n} with MsMsM_{s}\neq M_{s^{\prime}} for sss\neq s^{\prime} (strict monotonicity of the stack, lemma˜4.2). The relays γ^Msms\hat{\gamma}_{M_{s}m_{s}} and γ^Msms\hat{\gamma}_{M_{s^{\prime}}m_{s^{\prime}}} have disjoint support on the Preisach half-plane 𝒯\mathcal{T} when MsMsM_{s}\neq M_{s^{\prime}}, so their contributions to PALμ\operatorname{PAL}_{\mu} are orthogonal — no cross-pattern interference.

Capacity (part 3): The stack depth knext(u1:P)k\leq n_{\mathrm{ext}}(u_{1:P}) is bounded by the number of local extrema of the activation sequence, since each extremum corresponds to at most one stack entry by the push/pop mechanics of algorithm˜1. Each entry encodes one pattern without interference. The Hopfield bound 0.14N0.14N follows from the rank of the Hebbian weight matrix W=N1μ𝝃μ(𝝃μ)W=N^{-1}\sum_{\mu}\boldsymbol{\xi}^{\mu}(\boldsymbol{\xi}^{\mu})^{\top}, which is limited by the number of stored patterns relative to NN. PAL has no weight matrix — its capacity is limited by stack depth, not by the dimension of the weight space. ∎

Remark 9.6 (Retrieval mechanism).

To retrieve pattern 𝛏ts\boldsymbol{\xi}^{t_{s}} from the PAL stack, a query signal q=𝛏ts2q=\|\boldsymbol{\xi}^{t_{s}}\|_{2} is presented. The relay γ^Ms+εMsε[q]\hat{\gamma}_{M_{s}+\varepsilon M_{s}-\varepsilon}[q] activates uniquely for the pattern at stack position ss, and the associated value μMs,ms𝛏ts\mu_{M_{s},m_{s}}\cdot\boldsymbol{\xi}^{t_{s}} is returned. Retrieval time is O(logk)O(\log k) by binary search on the ordered stack.

9.3.3  Sequential Belief Propagation in Markov Random Fields

Markov Random Fields (MRF) with pairwise binary interactions are equivalent to Ising models with site-dependent external fields. Standard belief propagation (BP) on MRFs is a static algorithm requiring the full graph to be available upfront.

Proposition 9.7 (PAL as exact causal BP on tree-structured MRF).

Let 𝒢=(V,E)\mathcal{G}=(V,E) be a tree-structured MRF with binary variables σv{1,+1}\sigma_{v}\in\{-1,+1\} and pairwise potentials ψuv(σu,σv)=eJuvσuσv\psi_{uv}(\sigma_{u},\sigma_{v})=e^{J_{uv}\sigma_{u}\sigma_{v}}. Observations arrive sequentially in leaf-to-root order. Define the signal ut=tanh1(mt)u_{t}=\tanh^{-1}(m_{t}) where mt=𝔼[σtx1:t]m_{t}=\mathbb{E}[\sigma_{t}\mid x_{1:t}] is the marginal belief at step tt under belief propagation. Then the PAL operator with measure:

μ(α,β)=12log1+tanh(α)1tanh(α)𝟏[α=β]\mu^{*}(\alpha,\beta)=\tfrac{1}{2}\log\tfrac{1+\tanh(\alpha)}{1-\tanh(\alpha)}\cdot\mathbf{1}[\alpha=-\beta] (24)

satisfies PALμ(u0:n)=mn\operatorname{PAL}_{\mu^{*}}(u_{0:n})=m_{n} exactly for all tree-structured MRFs, where mnm_{n} is the exact marginal belief at the root. For loopy graphs, the PAL output approximates the loopy BP fixed point.

Proof.

On a tree, belief propagation computes exact marginals by the Bethe-Peierls equations (Mézard and Montanari, 2009). In leaf-to-root order, each message depends only on previously received messages, making the computation causal. The effective field at the root satisfies heff=uvtanh1(muv)h^{\mathrm{eff}}=\sum_{u\in\partial v}\tanh^{-1}(m_{u\to v}), which is the fixed-point of the Preisach self-consistency equation m=tanh(Jm+H)m=\tanh(Jm+H) from proposition˜9.1. With measure μ\mu^{*} from (24), PALμ(u0:n)\operatorname{PAL}_{\mu^{*}}(u_{0:n}) tracks heffh^{\mathrm{eff}} causally, and tanh(PALμ(u0:n))=mn\tanh(\operatorname{PAL}_{\mu^{*}}(u_{0:n}))=m_{n} exactly on trees by induction on tree depth. ∎

9.4  Avalanches and Phase Transitions in PAL

A striking property of the RFIM is its disorder-induced phase transition: at a critical disorder strength Δc\Delta_{c}, the hysteresis loop develops a discontinuity corresponding to a macroscopic avalanche of spin flips (Sethna et al., 2006). At criticality, avalanche sizes follow a power law P(s)sτP(s)\sim s^{-\tau} with universal exponent τ\tau.

Proposition 9.8 (Scalar PAL criticality).

A scalar PAL with measure μ\mu drawn from a Gaussian distribution with variance Δ2\Delta^{2} over the two-dimensional Preisach half-plane 𝒯\mathcal{T} exhibits a phase transition at critical variance Δc\Delta_{c}:

  • For Δ>Δc\Delta>\Delta_{c}: the output PALμ(u0:n)\operatorname{PAL}_{\mu}(u_{0:n}) varies smoothly with unu_{n}subcritical regime, corresponding to gradual relay activations.

  • For Δ<Δc\Delta<\Delta_{c}: a macroscopic fraction of relays activate simultaneously at a critical threshold — supercritical regime, corresponding to an infinite avalanche.

  • At Δ=Δc\Delta=\Delta_{c}: relay activations follow a power law in group size, with the same universality class as the mean-field RFIM.

Proof.

By proposition˜9.1, PAL with Gaussian measure is equivalent to mean-field RFIM with disorder Δ\Delta. The phase transition of the RFIM at Δc=J\Delta_{c}=J (Dahmen and Sethna, 1993) translates directly to PAL. At Δc\Delta_{c}, the self-consistency equation m=𝒫(h)𝟏[h>JmH]𝑑hm=\int\mathcal{P}(h)\mathbf{1}[h>-Jm-H]\,dh has a bifurcation point, corresponding to a jump discontinuity in the Preisach output. ∎

Remark 9.9 (Implications for PAL learning).

proposition˜9.8 has a practical implication: if the learned measure μ\mu^{*} concentrates near the critical disorder Δc\Delta_{c}, the PAL layer operates near a phase transition — maximising sensitivity to input changes while maintaining structured memory. This is analogous to the edge of chaos hypothesis in recurrent neural networks (Langton, 1990), but with a precise physical characterisation through RFIM criticality. Training PAL near Δc\Delta_{c} may be a principled alternative to spectral radius regularisation of recurrent weights.

9.5  Summary: PAL vs. Ising-based Methods

Criterion Ising / RFIM PAL Advantage
Data model Static graph, O(n2)O(n^{2}) couplings Sequential stream PAL
Update cost O(n2)O(n^{2}) per new node O(logk)O(\log k) per new token PAL
Forgetting None (quenched disorder) Significance-based wiping PAL
Exact optimisation NP-hard in general Approximation only Ising
Equilibrium guarantees Yes (Gibbs measure) No (causal only) Ising
Universality class Known (τ,ν,\tau,\nu,\ldots) Inherited from RFIM Equal
Coupling structure Arbitrary JijJ_{ij} Mean-field implicit Ising
Sequence modelling Not native Native (rate-independent) PAL

10  Conclusion

We introduced the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in classical hysteresis theory. The results establish:

  1. 1.

    PAL-Transformer is Turing-complete at depth O(1)O(1), improving on the O(logn)O(\log n) depth of standard transformers (theorem˜4.6). Moreover, a vector PAL (vPAL) with two-dimensional signal projection achieves Turing completeness with a single head (H=1H=1), establishing that signal dimension and head count are exchangeable resources in PAL (corollaries˜4.8 and 4.11).

  2. 2.

    The function classes of PAL and transformer are incomparable, with rate-independence as the separating property (theorems˜5.1 and 5.7).

  3. 3.

    PAL corresponds exactly to Extremum First-Order Logic (EFO), a strict fragment of the FO\mathrm{FO} + Aggregate class corresponding to transformers (theorem˜6.2).

  4. 4.

    PAL is the learned, sequential, causal generalisation of the mean-field Random-Field Ising Model at T=0T=0, inheriting its universality class and phase-transition structure (propositions˜9.1 and 9.8).

PAL is thus a natural architecture for tasks with long episodic memory, where the significance of past events matters more than their recency or position, and where the problem structure is naturally binary and threshold-driven.

Open questions.
  1. 1.

    Empirical validation: Do the predicted task advantages materialise in practice? Experiments on state-tracking benchmarks (MQAR, SCROLLS) would test theorem˜5.1 empirically.

  2. 2.

    Learning dynamics and criticality: Does gradient-based learning of μ(α,β)\mu(\alpha,\beta) drive the measure toward the critical disorder Δc\Delta_{c}? Is training near criticality beneficial empirically, analogous to the edge-of-chaos effect?

  3. 3.

    Beyond mean-field: Can the equivalence in proposition˜9.1 be extended beyond mean-field RFIM to short-range interactions (Bethe lattice, finite-dimensional RFIM)? This would connect PAL to a richer universality class.

  4. 4.

    Hybrid architectures: Can PAL heads be combined with standard attention heads to obtain both random access and significance-based memory?

  5. 5.

    Continuous-time extension: The rate-independence of PAL suggests a natural extension to continuous-time sequence models (neural ODEs, S4), where the driving signal is a continuous path rather than a discrete sequence.

  6. 6.

    Vector PAL and multi-dimensional inputs: corollary˜4.8 establishes that vPAL with a two-dimensional signal 𝐮n2\mathbf{u}_{n}\in\mathbb{R}^{2} is Turing-complete at H=1H=1 head — reducing the head count from H=4H=4 (scalar PAL) to H=1H=1 by exploiting the two independent extremum stacks of the vector signal. This follows the superposition of Preisach half-planes developed by Frydrych (2019) for two-axis fluxgate sensors (Chapters 4.2.2–4.2.6), where the full magnetisation vector 𝐌\mathbf{M} is integrated over the three-dimensional space (α,β,θ)(\alpha,\beta,\theta) with displacement γαβ(θ)\gamma_{\alpha\beta}(\theta) and rotation χαβ(θ)\chi_{\alpha\beta}(\theta) components. Three questions remain open for vPAL: (a) Does the vector RFIM connection extend to the anisotropic case studied by Frydrych, where the measure μ(α,β,θ)\mu(\alpha,\beta,\theta) breaks rotational symmetry? (b) Can the domain-rotation component χαβ(θ)\chi_{\alpha\beta}(\theta) be interpreted as a differentiable residual that complements the binary relay — analogous to soft attention complementing hard attention? (c) Does the exponential growth of the measure parameter space (L2HL^{2H} for HH-dimensional vPAL) create a fundamental expressiveness–efficiency tradeoff absent in scalar PAL?

Broader Impact Statement

This work introduces a theoretical architecture for sequence modelling grounded in classical hysteresis theory. The primary contributions are mathematical — Turing completeness proofs, expressiveness separations, and logical characterisations — and do not directly enable any specific application.

The connection to the Random-Field Ising Model (Section 8) suggests potential applications in combinatorial optimisation, associative memory, and belief propagation. These are established areas of machine learning with broad beneficial applications. We are not aware of direct pathways from this theoretical work to harmful applications.

If implemented in practice, PAL-based models would share the general risks of machine learning systems: potential for bias amplification, misuse in surveillance or manipulation, and environmental cost of training large models. These risks are not specific to PAL and are addressed by general ML ethics guidelines.

The O(nlogn)O(n\log n) computational complexity of PAL (versus O(n2)O(n^{2}) for attention) may reduce the energy cost of training long-context models, which is a potential positive environmental impact.

References

  • E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou (2023) What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • D. J. Amit, H. Gutfreund, and H. Sompolinsky (1985) Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters 55 (14), pp. 1530–1533. External Links: Document Cited by: item 3, §9.3.2.
  • P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva (2020) The logical expressiveness of graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Corollary A.9, item 4, §1, §6, §8.
  • R. Barroso, A. Egaña, J. L. Marqués, I. Etxeberria-Otadui, and O. Curea (2015) Preisach modeling of LiFePO4 lithium-iron-phosphate battery hysteresis. Journal of Energy Storage 2, pp. 65–72. External Links: Document Cited by: §8.
  • A. Behrouz, P. Zhong, and V. Mirrokni (2025) Titans: learning to memorize at test time. In arXiv preprint arXiv:2501.00663, External Links: Link Cited by: §8.
  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. In arXiv preprint arXiv:2004.05150, External Links: Link Cited by: §8.
  • M. Brokate and J. Sprekels (1996) Hysteresis and phase transitions. Applied Mathematical Sciences, Vol. 121, Springer, New York. External Links: Document Cited by: Proposition 2.3.
  • G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. External Links: Document Cited by: §A.2, §B.2, item 2.
  • K. Dahmen and J. P. Sethna (1993) Hysteresis loop critical exponents in 6ε6-\varepsilon dimensions. Physical Review Letters 71 (20), pp. 3222–3225. External Links: Document Cited by: §9.1, §9.4.
  • P. Frydrych and R. Szewczyk (2014) New portfolio risk optimisation method for strongly dependent assets. Journal of Engineering Studies and Research 20 (3), pp. 30–37. Cited by: §1, item 4, §8.
  • P. Frydrych (2019) Modelowanie charakterystyk magnesowania amorficznych rdzeni dwuosiowych sensorów transduktorowych. Ph.D. Thesis, Politechnika Warszawska, Wydział Mechatroniki, Warsaw, Poland. Cited by: §1, item 6, Definition 3.2, §4.2, §8.
  • M. Furst, J. B. Saxe, and M. Sipser (1984) Parity, circuits, and the polynomial-time hierarchy. In Mathematical Systems Theory, Vol. 17, pp. 13–27. External Links: Document Cited by: Corollary 4.7.
  • A. Gu and T. Dao (2023) Mamba: linear-time sequence modeling with selective state spaces. In arXiv preprint arXiv:2312.00752, External Links: Link Cited by: §1, §8.
  • M. Hahn (2020) Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics 8, pp. 156–171. External Links: Document Cited by: §1, §5.1, §8.
  • J. Håstad (1987) Computational limitations of small-depth circuits. MIT Press, Cambridge, MA. Cited by: §5.1.
  • J. E. Hopcroft and J. D. Ullman (1979) Introduction to automata theory, languages, and computation. Addison-Wesley, Reading, MA. Cited by: §4.2, §4.
  • J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79 (8), pp. 2554–2558. External Links: Document Cited by: §9.3.2.
  • C. G. Langton (1990) Computation at the edge of chaos: phase transitions and emergent computation. Physica D: Nonlinear Phenomena 42 (1–3), pp. 12–37. External Links: Document Cited by: Remark 9.9.
  • A. Lucas (2014) Ising formulations of many NP problems. Frontiers in Physics 2, pp. 5. External Links: Document Cited by: §9.3.1.
  • I. D. Mayergoyz (1991) Mathematical models of hysteresis. Springer, New York. External Links: Document Cited by: item 4, Proposition 2.4, Proposition 2.5.
  • W. Merrill and A. Sabharwal (2023) The parallelism tradeoff: limitations of log-precision transformers. Transactions of the Association for Computational Linguistics 11, pp. 531–545. External Links: Document Cited by: §8.
  • M. Mézard and A. Montanari (2009) Information, physics, and computation. Oxford University Press, Oxford. External Links: Document Cited by: §9.3.3.
  • B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, et al. (2023) RWKV: reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048–14077. External Links: Document Cited by: §8.
  • J. Pérez, J. Marinković, and P. Barceló (2021) Attention is Turing complete. Journal of Machine Learning Research 22 (75), pp. 1–35. External Links: Link Cited by: item 2, §1, §2.3, §4.2, §8.
  • F. Preisach (1935) Über die magnetische Nachwirkung. Zeitschrift für Physik 94 (5–6), pp. 277–302. External Links: Document Cited by: §1, §2.1.
  • J. P. Sethna, K. A. Dahmen, and O. Perkovic (2006) Random-field ising models of hysteresis. The Science of Hysteresis 2, pp. 107–179. External Links: Link Cited by: §9.1, §9.4, Remark 9.2.
  • N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • H. T. Siegelmann and E. D. Sontag (1995) On the computational power of neural nets. Journal of Computer and System Sciences 50 (1), pp. 132–150. External Links: Document Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §1.
  • J. von Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvintsev, A. Zhmoginov, and M. Vladymyrov (2023) Transformers learn in-context by gradient descent. In International Conference on Machine Learning, Vol. 202, pp. 35151–35174. Cited by: §1.
  • M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, et al. (2020) Big Bird: transformers for longer sequences. In Advances in Neural Information Processing Systems, Vol. 33, pp. 17283–17297. Cited by: Appendix C, §8.

Appendix A Full Proof of Theorem 6.2: PAL corresponds to EFO

We prove Theorem 6.2 by structural induction on EFO formula complexity, establishing a constructive correspondence between EFO formulae and PAL-Transformer computations.

A.1  Formal Setup

We work over sequences 𝐱=(x0,x1,,xn)\mathbf{x}=(x_{0},x_{1},\ldots,x_{n}) where each xidx_{i}\in\mathbb{R}^{d}. Let (𝐱)={i:xi is a local extremum of xi2}\mathcal{E}(\mathbf{x})=\{i:x_{i}\text{ is a local extremum of }\|x_{i}\|_{2}\} denote the set of extremal positions of 𝐱\mathbf{x}.

Definition A.1 (EFO syntax).

The grammar of Extremum First-Order Logic (EFO) over sequences is:

ϕ\displaystyle\phi ::=uicuici<extjϕψϕψ¬ϕ\displaystyle::=\top\mid\bot\mid u_{i}\leq c\mid u_{i}\geq c\mid i<_{\mathrm{ext}}j\mid\phi\wedge\psi\mid\phi\vee\psi\mid\neg\phi
extiϕ(i)extiϕ(i)ExtAggi[f(i),ϕ(i)]\displaystyle\quad\mid\exists^{\mathrm{ext}}i\,\phi(i)\mid\forall^{\mathrm{ext}}i\,\phi(i)\mid\mathrm{ExtAgg}_{i}[f(i),\phi(i)]

where cc\in\mathbb{R} is a constant, ii ranges over extremal positions only, i<extji<_{\mathrm{ext}}j denotes that extremal position ii occurred before extremal position jj in the sequence, and ExtAggi[f(i),ϕ(i)]\mathrm{ExtAgg}_{i}[f(i),\phi(i)] denotes:

ExtAggi[f(i),ϕ(i)]=i(𝐱),ϕ(i) holdsf(xi),\mathrm{ExtAgg}_{i}[f(i),\phi(i)]\;=\;\sum_{i\in\mathcal{E}(\mathbf{x}),\;\phi(i)\text{ holds}}f(x_{i}), (25)

where f:df:\mathbb{R}^{d}\to\mathbb{R} is a measurable function. The ordering relation <ext<_{\mathrm{ext}} is needed to express the causal order of relay state changes, since the relay state at time nn depends on which of the activation and deactivation thresholds was crossed most recently.

Definition A.2 (EFO semantics).

A sequence 𝐱\mathbf{x} satisfies ϕ\phi (written 𝐱ϕ\mathbf{x}\models\phi) according to the standard first-order semantics restricted to extremal positions:

  • 𝐱uic\mathbf{x}\models u_{i}\geq c iff xicx_{i}\geq c for the current assignment of ii.

  • 𝐱extiϕ(i)\mathbf{x}\models\exists^{\mathrm{ext}}i\,\phi(i) iff there exists i(𝐱)i\in\mathcal{E}(\mathbf{x}) such that 𝐱ϕ(i)\mathbf{x}\models\phi(i).

  • Boolean connectives have their standard meaning.

  • ExtAggi[f(i),ϕ(i)]\mathrm{ExtAgg}_{i}[f(i),\phi(i)] sums f(xi)f(x_{i}) over extremal positions satisfying ϕ\phi.

A.2  The Correspondence

We establish a bijection between EFO formulae and PAL computations by defining a compilation map :EFOPAL-Transformer\llbracket\cdot\rrbracket:\mathrm{EFO}\to\mathrm{PAL\text{-}Transformer}.

Lemma A.3 (Extremal position indicator).

The indicator 𝟏[i(𝐱)]\mathbf{1}[i\in\mathcal{E}(\mathbf{x})] is computable by a single-head PAL as:

𝟏[i(𝐱)]=γ^αi+εαiε[u](i)γ^αi+εαiε[u](i1),\mathbf{1}[i\in\mathcal{E}(\mathbf{x})]=\hat{\gamma}_{\alpha_{i}+\varepsilon\alpha_{i}-\varepsilon}[u](i)-\hat{\gamma}_{\alpha_{i}+\varepsilon\alpha_{i}-\varepsilon}[u](i-1), (26)

where uj=xj2u_{j}=\|x_{j}\|_{2} is the scalar projection of the sequence (one specific scalarisation; any injective ϕ:d\phi:\mathbb{R}^{d}\to\mathbb{R} suffices for the lemma to hold). That is, the relay changes state at step ii if and only if uiu_{i} is a local extremum.

Proof.

By Definition 2.1, the relay γ^αi+εαiε[u](i)\hat{\gamma}_{\alpha_{i}+\varepsilon\alpha_{i}-\varepsilon}[u](i) changes from 0 to 1 at step ii iff uiαi+εu_{i}\geq\alpha_{i}+\varepsilon, and from 1 to 0 iff uiαiεu_{i}\leq\alpha_{i}-\varepsilon. In both cases, a state change occurs exactly when uiu_{i} crosses one of the thresholds, which — by choice of αi=ui\alpha_{i}=u_{i} and ε0\varepsilon\to 0 — happens exactly at local extrema. The difference (26) detects this state change. ∎

Lemma A.4 (Threshold comparison).

The formula uicu_{i}\geq c at an extremal position ii is computable by a PAL with thresholds α=c+ε\alpha=c+\varepsilon, β=cε\beta=c-\varepsilon as:

𝟏[uic]=γ^c+εcε[u](i).\mathbf{1}[u_{i}\geq c]\;=\;\hat{\gamma}_{c+\varepsilon c-\varepsilon}[u](i). (27)
Proof.

Direct from Definition 2.1: when the input uiu_{i} crosses threshold α=c+ε\alpha=c+\varepsilon, the relay switches to 1; when it falls below β=cε\beta=c-\varepsilon, it switches to 0. In the limit ε0\varepsilon\to 0, this detects uicu_{i}\geq c at each extremal step. ∎

Lemma A.5 (Boolean combinations via MLP).

Let ϕ1,ϕ2\phi_{1},\phi_{2} be two EFO formulae computable by PAL heads producing outputs r1,r2{0,1}r_{1},r_{2}\in\{0,1\}. Then ϕ1ϕ2\phi_{1}\wedge\phi_{2}, ϕ1ϕ2\phi_{1}\vee\phi_{2}, and ¬ϕ1\neg\phi_{1} are computable by a two-layer MLP applied to (r1,r2)(r_{1},r_{2}).

Proof.

Boolean functions on {0,1}2\{0,1\}^{2} are representable by two-layer networks with ReLU activations (Cybenko 1989):

r1r2\displaystyle r_{1}\wedge r_{2} =ReLU(r1+r21),\displaystyle=\mathrm{ReLU}(r_{1}+r_{2}-1),
r1r2\displaystyle r_{1}\vee r_{2} =ReLU(r1+r2)ReLU(r1+r21),\displaystyle=\mathrm{ReLU}(r_{1}+r_{2})-\mathrm{ReLU}(r_{1}+r_{2}-1),
¬r1\displaystyle\neg r_{1} =1r1.\displaystyle=1-r_{1}.

These are exact (not approximate) representations on {0,1}\{0,1\}. ∎

Lemma A.6 (Extremum aggregation).

The formula ExtAggi[f(xi),ϕ(i)]\mathrm{ExtAgg}_{i}[f(x_{i}),\phi(i)] is computable by a single MPAL head with measure:

μ(α,β)=f(Dec(α,β))𝟏[ϕ holds at (α,β)],\mu_{(\alpha,\beta)}=f(\operatorname{Dec}(\alpha,\beta))\cdot\mathbf{1}[\phi\text{ holds at }(\alpha,\beta)], (28)

where Dec(α,β)\operatorname{Dec}(\alpha,\beta) recovers the value xix_{i} encoded at thresholds (α,β)(\alpha,\beta).

Proof.

The MPAL output for a single head with measure μ\mu is:

PALμ(u0:n)=ijμijγ^αiβj[u](n).\operatorname{PAL}_{\mu}(u_{0:n})=\sum_{i\geq j}\mu_{ij}\cdot\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n).

At time nn, γ^αiβj[u](n)=1\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n)=1 if and only if the current extremum stack Πn\Pi_{n} contains a pair (M,m)(M,m) with M(αiε,αi+ε)M\in(\alpha_{i}-\varepsilon,\alpha_{i}+\varepsilon) and m(βjε,βj+ε)m\in(\beta_{j}-\varepsilon,\beta_{j}+\varepsilon), i.e. the relay encodes an active extremal position satisfying the threshold condition.

Setting μij=f(Dec(αi,βj))𝟏[ϕ\mu_{ij}=f(\operatorname{Dec}(\alpha_{i},\beta_{j}))\cdot\mathbf{1}[\phi holds at (i,j)](i,j)] makes the PAL output equal to (25), since the relay sums over exactly those extremal positions where ϕ\phi holds, weighted by ff. ∎

Lemma A.7 (Existential extremal quantification).

The formula extiϕ(i)\exists^{\mathrm{ext}}i\,\phi(i) is computable by a PAL followed by a threshold MLP:

𝐱extiϕ(i)ReLU(ExtAggi[𝟏,ϕ(i)])>0.\mathbf{x}\models\exists^{\mathrm{ext}}i\,\phi(i)\;\iff\;\mathrm{ReLU}\!\left(\mathrm{ExtAgg}_{i}[\mathbf{1},\phi(i)]\right)>0. (29)
Proof.

ExtAggi[𝟏,ϕ(i)]\mathrm{ExtAgg}_{i}[\mathbf{1},\phi(i)] counts the number of extremal positions satisfying ϕ\phi. This is positive iff at least one such position exists, which is exactly the semantics of exti\exists^{\mathrm{ext}}i. The threshold >0>0 is implemented by a ReLU followed by a Heaviside (approximated to arbitrary precision by a steep sigmoid, exact under unit-cost arithmetic). ∎

A.3  Inductive Proof of Theorem 6.2

Proof of Theorem 6.2.

We prove both directions by structural induction on EFO formula complexity.

(\Rightarrow) Every PAL-computable function is EFO-definable.

We show that every primitive PAL operation corresponds to an EFO formula.

Base case — single relay: We show γ^αβ[u](n)\hat{\gamma}_{\alpha\beta}[u](n) is EFO-definable. By the wiping property (proposition˜2.4) and stack sufficiency (proposition˜2.7), the relay state at time nn is determined entirely by the most recent state-changing extremum. Specifically, γ^αβ[u](n)=1\hat{\gamma}_{\alpha\beta}[u](n)=1 iff there exists an extremal position tnt^{*}\leq n at which the relay was activated (utαu_{t^{*}}\geq\alpha), and no subsequent extremal position t(t,n]t\in(t^{*},n] deactivated it (utβu_{t}\leq\beta). This is expressed in EFO with ordering as:

γ^αβ[u](n)=1exttn[utαextt:t<exttut>β].\hat{\gamma}_{\alpha\beta}[u](n)=1\;\iff\;\exists^{\mathrm{ext}}t^{*}\leq n\;\Bigl[u_{t^{*}}\geq\alpha\;\wedge\;\forall^{\mathrm{ext}}t:t^{*}<_{\mathrm{ext}}t\;\Rightarrow\;u_{t}>\beta\Bigr]. (30)

The existential quantifier ranges over extremal positions by definition˜A.1; the comparison utαu_{t^{*}}\geq\alpha is atomic; the ordering t<exttt^{*}<_{\mathrm{ext}}t uses the causal order in EFO; and ut>βu_{t}>\beta is atomic. Hence (30) is an EFO formula.

Inductive case — weighted sum: PALμ(u0:n)=ijμijγ^αiβj[u](n)\operatorname{PAL}_{\mu}(u_{0:n})=\sum_{i\geq j}\mu_{ij}\hat{\gamma}_{\alpha_{i}\beta_{j}}[u](n) is a weighted sum of relay states. By the inductive hypothesis each γ^αiβj\hat{\gamma}_{\alpha_{i}\beta_{j}} is EFO-definable; their weighted sum is ExtAggi[μγ^,]\mathrm{ExtAgg}_{i}[\mu\cdot\hat{\gamma}_{\cdot\cdot},\top], which is EFO by Definition A.1.

MLP layers: MLP computes Boolean combinations (Lemma A.5) and threshold functions of PAL outputs, all of which are EFO-definable by induction.

Multi-head composition: The MPAL output is a sum of single-head PAL outputs; EFO is closed under addition (as a special case of ExtAgg\mathrm{ExtAgg}).

By induction, the entire PAL-Transformer output is EFO-definable.

(\Leftarrow) Every EFO formula is PAL-computable.

We show that each EFO construct is implementable by a PAL component.

Atomic formula uicu_{i}\geq c: Implemented by Lemma A.4 with α=c+ε\alpha=c+\varepsilon, β=cε\beta=c-\varepsilon.

Boolean combinations: Implemented by Lemma A.5.

Existential quantification extiϕ(i)\exists^{\mathrm{ext}}i\,\phi(i): Implemented by Lemma A.7.

Universal quantification extiϕ(i)\forall^{\mathrm{ext}}i\,\phi(i): Equivalent to ¬(exti¬ϕ(i))\neg(\exists^{\mathrm{ext}}i\,\neg\phi(i)), implementable by Lemmas A.5 and A.7.

Extremum aggregation ExtAggi[f(i),ϕ(i)]\mathrm{ExtAgg}_{i}[f(i),\phi(i)]: Implemented by Lemma A.6 with measure μij=f(Dec(αi,βj))𝟏[ϕ\mu_{ij}=f(\operatorname{Dec}(\alpha_{i},\beta_{j}))\cdot\mathbf{1}[\phi holds at (i,j)](i,j)].

Nested formulae: Suppose ϕ=ExtAggi[f(i),ψ(i)]\phi=\mathrm{ExtAgg}_{i}[f(i),\psi(i)] where ψ\psi is an EFO formula. By the inductive hypothesis, ψ\psi is implementable by a PAL sub-computation producing output ri{0,1}r_{i}\in\{0,1\} for each extremal position ii. Then ExtAggi[f(i),ψ(i)]\mathrm{ExtAgg}_{i}[f(i),\psi(i)] is implemented by a PAL head with measure μij=f(Dec(αi,βj))rij\mu_{ij}=f(\operatorname{Dec}(\alpha_{i},\beta_{j}))\cdot r_{ij}, where rijr_{ij} is the output of the sub-computation. This composes PAL layers — at most one additional layer per level of nesting.

By induction on formula depth, every EFO formula is implementable by a PAL-Transformer with depth proportional to the nesting depth of ExtAgg\mathrm{ExtAgg} operators. ∎

A.4  Corollaries of the EFO Correspondence

Corollary A.8 (Decidability of PAL expressiveness).

Given a function f:n+1f:\mathbb{R}^{n+1}\to\mathbb{R}, deciding whether fPALf\in\mathcal{F}_{\operatorname{PAL}} reduces to deciding whether ff is EFO-definable, which is decidable for functions over finite alphabets by standard model-theoretic methods.

Corollary A.9 (EFO \subsetneq FO + Aggregate).

EFO\mathrm{EFO} is strictly contained in FO+Aggregate\mathrm{FO}+\mathrm{Aggregate} (the logic corresponding to soft-attention transformers (Barceló et al., 2020)):

EFOFO+Aggregate.\mathrm{EFO}\subsetneq\mathrm{FO}+\mathrm{Aggregate}.

The strict inclusion is witnessed by fcopy(x0:n,p)=xpf_{\mathrm{copy}}(x_{0:n},p)=x_{p}: definable in FO\mathrm{FO} with positional quantification i(i=p)\exists i\,(i=p), but not in EFO\mathrm{EFO} (which cannot quantify over non-extremal positions). The reverse inclusion fails because frangef_{\mathrm{range}} is definable in EFO\mathrm{EFO} but not in bounded-depth FO+Aggregate\mathrm{FO}+\mathrm{Aggregate} (Proposition 5.3).

Corollary A.10 (PAL depth vs. formula nesting).

A kk-layer PAL-Transformer implements EFO formulae of nesting depth at most kk. Conversely, an EFO formula of nesting depth kk requires at most kk PAL layers. This gives a precise depth-expressiveness tradeoff: depth kk PAL \equiv EFO-depth kk.

Appendix B Auxiliary Lemmas for Section 4

B.1  Correctness of Stack Encoding under Repeated Symbols

Here we verify that the Cantor-depth encoding (Definition 4.1) handles stacks with repeated symbols correctly across all three operations.

Lemma B.1 (POP removes exactly one element).

Under the Cantor-depth encoding, the POP signal un+1=Mtop1+2εu_{n+1}=M_{\mathrm{top}-1}+2\varepsilon removes exactly the topmost pair (Mtop,mtop)(M_{\mathrm{top}},m_{\mathrm{top}}) and no other.

Proof.

The wiping property removes all pairs (Mi,mi)(M_{i},m_{i}) with Mi<un+1=Mtop1+2εM_{i}<u_{n+1}=M_{\mathrm{top}-1}+2\varepsilon. Since Mtop<Mtop1M_{\mathrm{top}}<M_{\mathrm{top}-1} (strict monotonicity), we have Mtop<Mtop1+2εM_{\mathrm{top}}<M_{\mathrm{top}-1}+2\varepsilon, so the pair (Mtop,mtop)(M_{\mathrm{top}},m_{\mathrm{top}}) is wiped. Since Mtop1<Mtop1+2εM_{\mathrm{top}-1}<M_{\mathrm{top}-1}+2\varepsilon implies Mtop1M_{\mathrm{top}-1} is not less than un+1u_{n+1}, the pair (Mtop1,mtop1)(M_{\mathrm{top}-1},m_{\mathrm{top}-1}) is preserved. (More precisely, wiping removes Mi<un+1M_{i}<u_{n+1}; Mtop1<Mtop1+2ε=un+1M_{\mathrm{top}-1}<M_{\mathrm{top}-1}+2\varepsilon=u_{n+1} is vacuously false since we check strict inequality.) Hence exactly one pair is removed. ∎

Lemma B.2 (PUSH does not trigger wiping).

Under the Cantor-depth encoding, the PUSH signal un+1=code(ai,d+1)+εu_{n+1}=\mathrm{code}^{*}(a_{i},d+1)+\varepsilon does not trigger the wiping property.

Proof.

By Lemma 4.2, code(ai,d+1)<code(aj,d)\mathrm{code}^{*}(a_{i},d+1)<\mathrm{code}^{*}(a_{j},d) for all j{1,,k}j\in\{1,\ldots,k\}. Therefore un+1=code(ai,d+1)+ε<Mtopu_{n+1}=\mathrm{code}^{*}(a_{i},d+1)+\varepsilon<M_{\mathrm{top}} (since Mtop=code(aj,d)+εM_{\mathrm{top}}=\mathrm{code}^{*}(a_{j},d)+\varepsilon for some jj and the ε\varepsilon gap is smaller than the inter-level gap Δ>0\Delta>0, provided ε<Δ/2\varepsilon<\Delta/2). Since un+1<Mtopu_{n+1}<M_{\mathrm{top}}, the wiping condition un+1>Mtopu_{n+1}>M_{\mathrm{top}} is not satisfied. A new pair is added without removing any existing pairs. ∎

B.2  MLP Width for Transition Function

Lemma B.3 (MLP width for finite transition functions).

For a 2-PDA with |Q||Q| states, input alphabet |Σ||\Sigma|, and stack alphabet |Γ||\Gamma|, the transition function δ:Q×Σ×Γ×ΓQ×Γ×Γ\delta:Q\times\Sigma\times\Gamma\times\Gamma\to Q\times\Gamma^{*}\times\Gamma^{*} is implementable by a two-layer MLP of width O(|Q||Σ||Γ|2)O(|Q|\cdot|\Sigma|\cdot|\Gamma|^{2}).

Proof.

δ\delta has domain of size |Q||Σ||Γ|2|Q|\cdot|\Sigma|\cdot|\Gamma|^{2} and is a finite lookup table. By the universal approximation theorem (Cybenko, 1989), any function on a finite domain of size NN is exactly representable by a two-layer network with NN hidden units through a table-lookup construction: for each input configuration (q,a,s1,s2)(q,a,s_{1},s_{2}), one hidden neuron fires for that configuration (indicator neuron) and outputs the corresponding transition values. Width: one neuron per table entry =O(|Q||Σ||Γ|2)=O(|Q|\cdot|\Sigma|\cdot|\Gamma|^{2}). ∎

Appendix C Relation to Sparse Attention

PAL can be viewed as a form of content-adaptive sparse attention where the sparsity pattern is determined by the sequence of local extrema rather than by position (sliding window) or random selection (BigBird (Zaheer et al., 2020)).

Proposition C.1 (PAL as extremal sparse attention).

PAL implements attention over the set of extremal positions:

PALμ(u0:n)=i(u0:n)aivi,\operatorname{PAL}_{\mu}(u_{0:n})=\sum_{i\in\mathcal{E}(u_{0:n})}a_{i}\cdot v_{i}, (31)

where ai=μ(α(ui),β(ui1))a_{i}=\mu_{(\alpha(u_{i}),\beta(u_{i-1}))} is a content-determined weight and vi=γ^α(ui)v_{i}=\hat{\gamma}_{\alpha(u_{i})\cdot} is the relay value. The attention mask is {i:i(u0:n)}\{i:i\in\mathcal{E}(u_{0:n})\}, which is determined by the input content, not position.

Proof.

At time nn, the PAL output sums over relay pairs (i,j)(i,j) where the relay is active. A relay γ^αiβj\hat{\gamma}_{\alpha_{i}\beta_{j}} is active at time nn only if the most recent state change was an activation at some extremal position tnt^{*}\leq n with utαiu_{t^{*}}\geq\alpha_{i}. This is equivalent to attending to extremal positions, with weights given by the measure μ\mu. ∎

This framing positions PAL as a principled alternative to heuristic sparse attention patterns: the sparsity is not imposed externally but emerges naturally from the structure of the Preisach operator.