Preisach Attention: A Hysteretic Model of Sequential Memory
Abstract
We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds , maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton — in contrast to the depth required by standard hard-attention transformers (Pérez et al., 2021). Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in layers that require layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence — PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with total inference cost versus for standard attention.
Keywords: Preisach operator, hysteresis, attention mechanism, Turing completeness, expressiveness, sequence modelling, long-range dependence, rate-independence.
Contents
- 1 Introduction
- 2 Background
- 3 Preisach Attention Layer
- 4 Turing Completeness
- 5 Expressiveness Separation
- 6 Logical Characterisation
- 7 Computational Complexity
- 8 Related Work
- 9 Connection to the Random-Field Ising Model
- 10 Conclusion
- References
- A Full Proof of Theorem 6.2: PAL corresponds to EFO
- B Auxiliary Lemmas for Section 4
- C Relation to Sparse Attention
1 Introduction
The transformer architecture (Vaswani et al., 2017) and its attention mechanism have become the dominant paradigm for sequence modelling. Theoretical analysis of transformer expressiveness has established that hard-attention transformers are Turing-complete (Pérez et al., 2021), that softmax attention corresponds to specific fragments of first-order logic (Hahn, 2020; Barceló et al., 2020), and that practical transformers implement approximate versions of classical algorithms (Akyürek et al., 2023; von Oswald et al., 2023).
A parallel body of work has studied alternative sequence models — recurrent networks (Siegelmann and Sontag, 1995), state space models (Gu and Dao, 2023), and mixture-of-experts architectures (Shazeer et al., 2017) — seeking more efficient representations of long-range dependence. These efforts share a common limitation: their memory mechanisms are either temporal (decaying by distance) or positional (indexed by token location), with no mechanism for value-based memory that persists based on the significance of past inputs rather than their recency.
This paper.
We propose a fundamentally different memory mechanism inspired by the Preisach hysteresis operator (Preisach, 1935), a classical model from mathematical physics used to describe ferromagnetic materials, elastoplastic systems, and — more recently — agent-based financial markets (Frydrych and Szewczyk, 2014; Frydrych, 2019). The Preisach operator aggregates the outputs of binary relays , each characterised by an activation threshold and a deactivation threshold , weighted by a learned measure over the threshold plane.
The key structural properties of the Preisach operator that motivate its use in sequence modelling are:
-
1.
Rate-independence: the output depends only on the sequence of local extrema of the input, not on temporal spacing or absolute position.
-
2.
Wiping property: a new extremum erases all previous extrema of smaller magnitude, providing a natural forgetting mechanism based on significance rather than recency.
-
3.
Extremum stack sufficiency: the stack of alternating local maxima and minima is a minimal sufficient statistic of input history for all rate-independent functionals.
-
4.
Universal approximation: the class of Preisach operators with is dense in the space of all continuous, causal, rate-independent functionals (Mayergoyz, 1991).
Main contributions.
-
1.
We define the Preisach Attention Layer (PAL) and its multi-head variant (MPAL), giving exact computational definitions and connecting them to the classical Preisach operator (section˜3).
- 2.
-
3.
We prove a formal expressiveness separation between PAL and transformer attention (section˜5), identifying rate-independence as the separating property.
- 4.
-
5.
We outline computational complexity advantages of PAL and identify task classes where PAL is predicted to outperform standard attention (section˜7).
2 Background
2.1 The Preisach Hysteresis Operator
The Preisach operator (Preisach, 1935) was introduced to model magnetic hysteresis in ferromagnetic materials. Let be the Preisach half-plane.
Definition 2.1 (Elementary relay).
For and input sequence , the elementary relay is defined by:
| (1) |
The interval is the dead band of the relay.
Definition 2.2 (Preisach operator).
Let be a signed measure on . The Preisach operator is:
| (2) |
The Preisach operator has three properties central to this paper.
Proposition 2.3 (Rate-independence (Brokate and Sprekels, 1996)).
For any non-decreasing bijection , .
Proposition 2.4 (Wiping property (Mayergoyz, 1991)).
Let be the alternating local maxima and minima of . If , then the relay state induced by is erased — the output after is identical to what it would be had not occurred.
Proposition 2.5 (Universal approximation (Mayergoyz, 1991)).
The set is dense in the space of all continuous, causal, rate-independent functionals on under the uniform topology.
2.2 The Extremum Stack
A fundamental algorithmic consequence of the wiping property is that the Preisach output depends only on the current extremum stack:
Definition 2.6 (Extremum stack).
The extremum stack of is the sequence:
| (3) |
of alternating local maxima and local minima , stored in decreasing order, where the pair records the -th local maximum and its subsequent local minimum.
Proposition 2.7 (Stack sufficiency).
is a measurable function of alone. In particular, two sequences and with identical extremum stacks satisfy for all .
Proposition 2.8 (Stack update complexity).
Algorithm˜1 runs in amortised time per step — each element is pushed and popped at most once — and requires space where is the current stack depth. The relay state can be read from in time by binary search.
2.3 Transformer Attention
For completeness we recall standard definitions. Given query , keys and values , scaled dot-product attention is:
| (4) |
Hard attention replaces softmax with argmax: .
We use the transformer model of Pérez et al. (2021): an -layer encoder-decoder with hard attention, sinusoidal position encoding, layer normalisation, and residual connections.
3 Preisach Attention Layer
3.1 Definition
Definition 3.1 (Preisach Attention Layer).
Let be a scalar input sequence and a discretised measure on the grid with , . The Preisach Attention Layer is:
| (5) |
Definition 3.2 (Multi-head PAL).
Let be a vector-valued sequence. Multi-head PAL with heads is:
| (6) |
where projects to a scalar signal and projects back to the model dimension. The scalar projection is a simplifying assumption of the present work; Section 10 discusses the vector extension (vPAL) where and the measure is defined over the three-dimensional space following Frydrych (2019).
Definition 3.3 (PAL-Transformer).
An -layer PAL-Transformer is defined by:
| (7) | ||||
| (8) | ||||
| (9) |
where is sinusoidal position encoding, is layer normalisation, and is a two-layer network with ReLU.
3.2 Connection to Classical Preisach Operator
Proposition 3.4 (PAL as discretised Preisach).
is the discretised Preisach operator where is an atomic measure concentrated on the grid . As and with fixed, uniformly on by proposition˜2.5.
3.3 Relationship to Standard Attention
Proposition 3.5 (Attention as continuous relaxation of PAL).
Standard softmax attention (4) is a continuous relaxation of PAL in which:
-
1.
The binary relay is replaced by the soft weight ,
-
2.
The measure is replaced by the value matrix ,
-
3.
Rate-independence is broken: attention depends on absolute position through , while PAL depends only on the sequence of extrema.
4 Turing Completeness
We prove that a single-layer PAL-Transformer is Turing-complete by showing it can simulate an arbitrary two-stack pushdown automaton (2-PDA), which is known to be equivalent to a Turing machine (Hopcroft and Ullman, 1979).
4.1 Encoding Alphabet Symbols in the Extremum Stack
Definition 4.1 (Cantor-depth encoding).
Let be a stack alphabet, a depth bound, a resolution parameter, and . The Cantor-depth encoding is:
| (10) |
Lemma 4.2 (Strict monotonicity).
For all and : .
Proof.
. ∎
Lemma 4.3 (Stack operations via signal generation).
Under encoding (10), the three stack operations are realised by the following signal emissions, without triggering the wiping property erroneously:
| (11) | ||||
| (12) | ||||
| (13) |
Proof.
PUSH: By lemma˜4.2, for all , so . The wiping condition () is not triggered; a new pair is added to the stack top.
POP: Since (strict monotonicity of stack), triggers wiping, removing . The signal itself satisfies (if depth ), so only one pair is removed.
TOP: where , which is a deterministic function of computable by an MLP. ∎
4.2 Main Theorem
Definition 4.4 (Autoregressive PAL-Transformer).
The Turing-completeness result of theorem˜4.6 is stated for the autoregressive (closed-loop) operational regime of the PAL-Transformer, distinct from the parallel encoder of definition˜3.3. In the autoregressive regime:
-
1.
At step , the PAL-Transformer receives the current input token (which may itself be a function of previous outputs) and produces output .
-
2.
A component of — specifically, the signal channels — is appended as the next input token , creating a closed generative loop.
-
3.
The extremum stacks persist across steps as the sole recurrent state; no other hidden state is maintained.
This regime corresponds to standard autoregressive language model inference. The result is a generation theorem (unbounded computation traces can be produced), not merely a verification theorem (pre-encoded traces classified). The parallel encoder definition (definition˜3.3) is used for training and for the expressiveness results of section˜5, where the full input sequence is available.
Lemma 4.5 (Multi-head PAL decodes the extremum stack).
For any extremum stack of depth at most over a discretised alphabet of size , there exist measures with heads such that the map
| (14) |
is injective. In particular, (the top-of-stack element) is a deterministic function of the -dimensional MPAL output, computable by a two-layer MLP.
Proof.
Under the Cantor-depth encoding (definition˜4.1), each stack element at depth has a unique scalar value . The total number of distinct values is , each separated by .
Construct PAL heads with indicator measures: , where maps threshold pair to the corresponding Cantor-depth code, and extracts the -th bit of the binary representation of that code.
Each head outputs , which equals the -th bit of the binary encoding of the top-of-stack Cantor code when the measure is concentrated on the relay active at the stack top.
The -bit vector uniquely identifies via binary decoding — a linear operation implementable by a single MLP layer. The full stack is decoded by repeating this procedure with a mask that deactivates the top relay after reading it (POP operation via the signal generation of lemma˜4.3). ∎
Theorem 4.6 (Turing completeness of PAL-Transformer).
For any two-stack pushdown automaton , there exists a single-layer PAL-Transformer with MPAL heads, one MLP layer, and arithmetic precision bits that simulates step-by-step on inputs of length .
Proof.
We construct with four independent signal channels , each processed by a dedicated MPAL head.
Channel 1 — Machine state.
encodes the current state as a scalar. Updated directly at each step by the MLP after computing .
Channel 2 — Stack 1.
encodes the contents of stack 1 via lemma˜4.3. The extremum stack is a bijective encoding of the stack-1 contents.
Channel 3 — Stack 2.
encodes stack 2 analogously.
Channel 4 — Input tape.
encodes the input word; position encoding allows reading symbol at step via an MPAL head with tuned to isolate position (cf. Pérez et al. (2021), Lemma 4).
Simulation step .
Correctness.
By induction: at step , stacks and faithfully encode the 2-PDA configuration . The base case is the initial configuration . The inductive step follows from lemmas˜4.3 and 4.2.
Since 2-PDA Turing machine (Hopcroft and Ullman, 1979), is Turing-complete.
Depth.
The construction uses MPAL layer and MLP layer — total depth , independent of .
Precision.
Encoding depth and alphabet size requires distinguishing values spaced apart up to , needing bits. ∎
Corollary 4.7 (Depth separation from Transformer).
There exists a language recognisable by a 1-layer PAL-Transformer that requires layers for any transformer with heads (under the circuit complexity lower bounds of Furst et al. (1984)).
Corollary 4.8 (Vector PAL is TC with a single head).
A single-layer vector PAL-Transformer (vPAL) with projection , a single head (), one MLP layer, and arithmetic precision bits is Turing-complete.
Proof.
The two-dimensional signal admits two independent extremum stacks:
| (15) |
We assign to encode stack 1 of the 2-PDA and to encode stack 2, both using the Cantor-depth encoding of definition˜4.1. The machine state and input tape are encoded in the angular structure of the vPAL measure : distinct sectors correspond to distinct states , following the superposition of Preisach half-planes introduced by Frydrych (2019) for two-axis fluxgate sensors (Chapter 4.2.2 of Frydrych 2019). The MLP reads and simultaneously from the single vPAL head and computes the transition exactly as in theorem˜4.6. Correctness and depth follow identically. Since head carries both stacks via the two signal dimensions, the head count is reduced from to at the cost of replacing the scalar projection with the vector projection . ∎
Remark 4.9 (Differentiability and practical training).
The binary relay is discontinuous in both the input and the thresholds , making exact backpropagation undefined at switching points. Practical implementations require a smooth relaxation, e.g. the stateful sigmoid relaxation with temperature during training (straight-through estimator or curriculum annealing). The theoretical results of this paper hold for the exact binary relay; the relaxed version is studied empirically in companion work.
Remark 4.10 (Position encoding and rate-independence).
Definition 3.3 includes sinusoidal position encoding , which may appear to contradict rate-independence. The tension is deliberate: is used by the MLP layers (to implement positional logic, e.g. reading the input tape in the TC proof) but is not passed through the MPAL heads. The MPAL output itself — the integral over active relays — remains rate-independent. Rate-independence therefore applies to the PAL component of the architecture, not to the full PAL-Transformer. This is consistent with theorem˜5.7, which characterises the function class of the PAL heads, and with the expressiveness separation, which constructs functions PAL heads cannot compute.
Remark 4.11 (Heads vs. signal dimensions as exchangeable resources).
corollary˜4.8 formalises the intuition that in PAL, the number of heads and the signal dimension are exchangeable: scalar heads can be replaced by a single head with -dimensional projection. This exchangeability does not hold for standard multi-head attention, where each head learns an independent linear projection without the hysteretic coupling that links dimensions in vPAL. Specifically, scalar PAL heads require parameters for the measure; a single -dimensional vPAL head requires parameters (the -fold Cartesian product of Preisach half-planes), which grows exponentially. The minimal TC architecture is therefore scalar PAL at (two stacks, two heads) or vPAL at (two stacks, one head with ).
5 Expressiveness Separation
We prove the central incomparability result.
Theorem 5.1 (Expressiveness incomparability).
Let and be the function classes computable by bounded-depth PAL-Transformer and Transformer respectively. Then:
| (16) |
5.1 Functions PAL computes but Transformer cannot (at bounded depth)
Proposition 5.2 (Historical range in layers).
The function is computable by a 1-layer PAL in time.
Proof.
and are directly readable from the extremum stack as the first maximum and last minimum. . ∎
Proposition 5.3 (Transformer requires layers for range).
Any transformer with layers and heads cannot compute exactly on sequences of length .
5.2 Functions Transformer computes but PAL cannot
Proposition 5.4 (Random-access retrieval).
The function (retrieve token at position ) is computable by a 1-layer hard-attention Transformer in layers.
Proof.
Set and . Hard attention selects . The output is . ∎
Proposition 5.5 (PAL cannot perform exact random access).
No PAL-Transformer (of any depth or head count) can compute for all sequences and positions.
Proof.
We give a structural impossibility proof based on rate-independence (propositions˜2.3 and 5.7).
Construction.
Fix any target position and any value . Define two sequences that differ only at position :
| (17) | ||||
| (18) |
where the surrounding values and satisfy (i.e. is in a strictly monotone ascending segment of both sequences).
Extremum stacks are identical.
Since position lies in a monotone ascending segment of both sequences, it is not a local extremum of either or . Therefore : the extremum stacks are identical, because only positions where is a local extremum enter the stack (algorithm˜1), and those positions are the same in both sequences.
Rate-independence forces identical outputs.
By theorem˜5.7, every PAL-Transformer computes a rate-independent function, hence a function of alone (proposition˜2.7). Since , any PAL-Transformer outputs the same value on and . But . Therefore no PAL-Transformer computes . ∎
Remark 5.6 (Why the probabilistic argument is incorrect).
An earlier version of this proof argued that position is a local extremum with probability for a random sequence, making inaccessible with high probability. This is incorrect: for i.i.d. sequences from a continuous distribution, an interior position is a local extremum (maximum or minimum) with probability by the symmetry of the six orderings of . The correct argument is the structural one above, which applies universally and does not rely on any probabilistic assumption about the input.
5.3 The separating property: rate-independence
Theorem 5.7 (Rate-independence as separating property).
A function is computable by a PAL-Transformer only if it is rate-independent in the sense of proposition˜2.3. Standard attention computes functions that are not rate-independent. Hence rate-independence is a necessary condition for membership in and a sufficient condition for .
Proof.
PAL is a composition of rate-independent operators (relay + linear combination) and rate-independent nonlinearities (ReLU is rate-independent). By closure under composition, any function computable by a PAL-Transformer is rate-independent.
Standard attention is not rate-independent: the attention weight depends explicitly on position through , so inserting a time-rescaling changes the output. ∎
6 Logical Characterisation
Barceló et al. (2020) showed that soft-attention transformers correspond to with aggregate functions — a fragment of first-order logic over sequences. We give an analogous characterisation for PAL.
Definition 6.1 (Extremum First-Order Logic, ).
is the extension of first-order logic over sequences with:
-
1.
Quantification over extremal positions: asserts that there exists a local extremum position satisfying .
-
2.
An extremum aggregate operator: sums over all extremal positions weighted by .
-
3.
No quantification over arbitrary positions.
Theorem 6.2 (PAL corresponds to EFO).
A function is computable by a bounded-depth PAL-Transformer if and only if it is definable in .
Proof sketch..
() Each MPAL head computes — an extremum aggregate. Composition through MLP layers corresponds to Boolean combinations of formulae.
() Each formula can be implemented by appropriate choice of and in MPAL. The extremal quantifier is implemented by the relay’s dead-band — only extremal positions cause relay state changes, so only extremal positions contribute to the sum. Full proof by structural induction on formula complexity; details in appendix˜A. ∎
Remark 6.3 (Comparison with FO + Aggregate).
: cannot quantify over arbitrary positions (no random access), while can. Conversely, can express directly as , which has no bounded-depth definition.
7 Computational Complexity
Theorem 7.1 (Complexity of PAL inference).
For a single-layer -head PAL-Transformer processing a sequence of length :
-
1.
Time per token: where is extremum stack depth, model dimension.
-
2.
Total time for sequence of length : .
-
3.
Memory: .
Compare with standard attention: time , memory .
| Architecture | Time (total) | Memory | Depth for TC | Heads for TC |
| Transformer (softmax) | not TC | — | ||
| Transformer (hard) | ||||
| Mamba / SSM | open | — | ||
| RWKV | open | — | ||
| PAL (scalar, ) | ||||
| vPAL (vector, ) |
Predicted task advantages.
By theorems˜5.1 and 5.7, PAL is predicted to outperform standard attention on tasks that are rate-independent and require long episodic memory:
-
1.
Tracking entity states across long documents (who did what to whom).
-
2.
Detecting anomalies in time series (historical range, running extrema).
-
3.
Reasoning over ordered events without positional sensitivity (logical puzzles, symbolic reasoning).
-
4.
Energy market dispatch where decision thresholds rather than price histories determine behaviour (Frydrych and Szewczyk, 2014).
8 Related Work
Expressiveness of transformers.
Pérez et al. (2021) proved Turing completeness of hard-attention transformers at depth . Hahn (2020) and Barceló et al. (2020) characterised soft-attention transformers as with aggregate functions. Merrill and Sabharwal (2023) established circuit complexity bounds separating transformer classes. PAL is a new point in this space, incomparable to existing architectures.
Alternative memory mechanisms.
Mamba (Gu and Dao, 2023) uses selective SSMs with linear recurrence and input-dependent forgetting. RWKV (Peng et al., 2023) replaces attention with time-decay weighted recurrence. Titans (Behrouz et al., 2025) learns long-term memory through online gradient descent. All use temporal (recency-based) forgetting; PAL uses significance-based forgetting through wiping.
Hysteresis in machine learning.
Frydrych and Szewczyk (2014); Frydrych (2019) applied Preisach-type models to financial market modelling. Barroso et al. (2015) used Preisach operators for battery electrochemical modelling. To our knowledge, this is the first work to use the Preisach operator as a sequence modelling layer in neural networks.
Sparse and efficient attention.
9 Connection to the Random-Field Ising Model
9.1 The Preisach–RFIM Equivalence
The connection between the Preisach operator and the Random-Field Ising Model (RFIM) has been established in the physics literature (Sethna et al., 2006; Dahmen and Sethna, 1993), and provides a bridge between PAL and a broad class of combinatorial and statistical problems.
The mean-field RFIM at consists of Ising spins with Hamiltonian:
| (19) |
where , is a uniform external field, and are independent random fields drawn from distribution .
At , each spin satisfies the single-spin stability condition: , where is the mean magnetisation. Spin flips from to when crosses the activation threshold from below, and flips back when crosses the deactivation threshold from above.
Proposition 9.1 (Preisach–RFIM equivalence).
The mean-field RFIM at , driven quasi-statically by external field , is equivalent to a Preisach operator with measure:
| (20) |
where the measure is supported on the line within the Preisach half-plane . The magnetisation satisfies the self-consistency equation .
Proof.
Each spin contributes a relay with (the coupling gap). The output and the aggregate magnetisation:
where is the empirical measure of threshold pairs, converging to (20) as by the law of large numbers applied to i.i.d. . ∎
Remark 9.2 (Structural consequences).
proposition˜9.1 implies that the hysteresis loop, subloop structure, and wiping property of the RFIM are exact consequences of the Preisach operator structure. In particular, the return-point memory of the RFIM (Sethna et al., 2006) is precisely the wiping property (proposition˜2.4), and the extremum stack (definition˜2.6) is the minimal sufficient statistic of the RFIM’s history under quasi-static driving.
9.2 PAL as a Learned, Sequential RFIM
Under the Preisach–RFIM equivalence, PAL is a learned, sequential, non-equilibrium generalisation of the RFIM:
| Dimension | RFIM | PAL |
|---|---|---|
| Spins | fixed binary | relay states , learned |
| Driving signal | quasi-static field | arbitrary sequence |
| Disorder | quenched | learned measure |
| Coupling | mean-field | implicit through self-consistency of |
| Dynamics | equilibrium (lowest energy) | causal, rate-independent |
| Memory | return-point memory (history of ) | extremum stack |
| Wiping | return-point memory property | proposition˜2.4 |
The key conceptual shift: the RFIM asks "what is the equilibrium configuration at field ?" while PAL asks "what is the causal output of the sequence ?" This shift from equilibrium to sequential opens PAL to a broad class of problems where Ising-type models apply but the data arrives as a stream.
9.3 Problems where PAL Inherits Ising Expressiveness
We identify three problem classes where the RFIM–Preisach equivalence suggests PAL has structural advantages.
9.3.1 Sequential Binary Optimisation
The Ising model underlies a broad class of NP-hard combinatorial problems: MAX-CUT, graph colouring, satisfiability, and the Hopfield associative memory (Lucas, 2014). Their energy function has the form:
| (21) |
and the goal is .
Proposition 9.3 (PAL tracks mean-field Ising energy for streaming interactions).
Let interactions arrive sequentially as a stream, drawn i.i.d. from a distribution with mean and variance . Define the scalar signal . Then the PAL output with measure satisfies:
| (22) |
where is the mean-field Ising energy for the empirical mean magnetisation , and the error follows from a standard concentration bound on the empirical mean.
Proof.
Each interaction contributes to the Ising energy when its relay , i.e. when — which holds trivially at the moment of arrival. The relay then tracks whether remains the dominant interaction. By the wiping property, relay remains active at time iff no subsequent has arrived. The PAL aggregate sums over all non-dominated interactions, giving a running estimate of the effective coupling.
For the mean-field case for all pairs, the self-consistency equation from proposition˜9.1 holds exactly: by the law of large numbers applied to over observations, recovering the mean-field energy . The error bound follows from Hoeffding’s inequality applied to the sum of bounded relay outputs. ∎
Remark 9.4 (Relation to MAX-CUT).
The mean-field Ising energy in proposition˜9.3 is related to the MAX-CUT value by for the optimal cut . However, PAL does not solve MAX-CUT — it tracks the running mean-field energy, which provides a lower bound on the cut value. The advantage is computational: PAL updates this estimate in per new interaction, without re-solving from scratch.
9.3.2 Associative Memory with Structured Forgetting
The Hopfield network (Hopfield, 1982) stores patterns as attractors of Ising dynamics with . Retrieval capacity is limited to patterns before catastrophic interference (Amit et al., 1985).
PAL offers a different memory model where forgetting is determined by significance (extremality) rather than capacity:
Proposition 9.5 (PAL as hysteretic associative memory with capacity bound).
Let patterns arrive sequentially with activation strengths . Let be the depth of the extremum stack after all patterns have been processed.
-
1.
Storage: The extremum stack retains exactly those patterns whose activation strength constitutes a local extremum of the stream . All other patterns are wiped by the wiping property (proposition˜2.4).
-
2.
Non-overlapping threshold support: The retained patterns correspond to disjoint threshold pairs with for (strict monotonicity of the stack, lemma˜4.2), so their relays have disjoint support on and their contributions to are orthogonal in measure space. Note that wiping is a form of interference between patterns: a stronger subsequent pattern erases a prior one. The claim is that the surviving patterns do not mutually interfere in the PAL output.
-
3.
Capacity: The maximum number of simultaneously retrievable patterns is exactly , bounded by the number of local extrema in :
(23) where is the number of local extrema in the activation strength sequence. This differs fundamentally from Hopfield capacity (Amit et al., 1985): PAL capacity is limited by stack depth, not weight-matrix rank.
Proof.
Storage (part 1): Direct from proposition˜2.4. Pattern is wiped when a subsequent pattern with arrives, since the new maximum overwrites the old in the extremum stack. Only patterns at local extrema of survive.
Zero interference (part 2): The PAL output is . Each retained pattern corresponds to a unique pair in with for (strict monotonicity of the stack, lemma˜4.2). The relays and have disjoint support on the Preisach half-plane when , so their contributions to are orthogonal — no cross-pattern interference.
Capacity (part 3): The stack depth is bounded by the number of local extrema of the activation sequence, since each extremum corresponds to at most one stack entry by the push/pop mechanics of algorithm˜1. Each entry encodes one pattern without interference. The Hopfield bound follows from the rank of the Hebbian weight matrix , which is limited by the number of stored patterns relative to . PAL has no weight matrix — its capacity is limited by stack depth, not by the dimension of the weight space. ∎
Remark 9.6 (Retrieval mechanism).
To retrieve pattern from the PAL stack, a query signal is presented. The relay activates uniquely for the pattern at stack position , and the associated value is returned. Retrieval time is by binary search on the ordered stack.
9.3.3 Sequential Belief Propagation in Markov Random Fields
Markov Random Fields (MRF) with pairwise binary interactions are equivalent to Ising models with site-dependent external fields. Standard belief propagation (BP) on MRFs is a static algorithm requiring the full graph to be available upfront.
Proposition 9.7 (PAL as exact causal BP on tree-structured MRF).
Let be a tree-structured MRF with binary variables and pairwise potentials . Observations arrive sequentially in leaf-to-root order. Define the signal where is the marginal belief at step under belief propagation. Then the PAL operator with measure:
| (24) |
satisfies exactly for all tree-structured MRFs, where is the exact marginal belief at the root. For loopy graphs, the PAL output approximates the loopy BP fixed point.
Proof.
On a tree, belief propagation computes exact marginals by the Bethe-Peierls equations (Mézard and Montanari, 2009). In leaf-to-root order, each message depends only on previously received messages, making the computation causal. The effective field at the root satisfies , which is the fixed-point of the Preisach self-consistency equation from proposition˜9.1. With measure from (24), tracks causally, and exactly on trees by induction on tree depth. ∎
9.4 Avalanches and Phase Transitions in PAL
A striking property of the RFIM is its disorder-induced phase transition: at a critical disorder strength , the hysteresis loop develops a discontinuity corresponding to a macroscopic avalanche of spin flips (Sethna et al., 2006). At criticality, avalanche sizes follow a power law with universal exponent .
Proposition 9.8 (Scalar PAL criticality).
A scalar PAL with measure drawn from a Gaussian distribution with variance over the two-dimensional Preisach half-plane exhibits a phase transition at critical variance :
-
•
For : the output varies smoothly with — subcritical regime, corresponding to gradual relay activations.
-
•
For : a macroscopic fraction of relays activate simultaneously at a critical threshold — supercritical regime, corresponding to an infinite avalanche.
-
•
At : relay activations follow a power law in group size, with the same universality class as the mean-field RFIM.
Proof.
By proposition˜9.1, PAL with Gaussian measure is equivalent to mean-field RFIM with disorder . The phase transition of the RFIM at (Dahmen and Sethna, 1993) translates directly to PAL. At , the self-consistency equation has a bifurcation point, corresponding to a jump discontinuity in the Preisach output. ∎
Remark 9.9 (Implications for PAL learning).
proposition˜9.8 has a practical implication: if the learned measure concentrates near the critical disorder , the PAL layer operates near a phase transition — maximising sensitivity to input changes while maintaining structured memory. This is analogous to the edge of chaos hypothesis in recurrent neural networks (Langton, 1990), but with a precise physical characterisation through RFIM criticality. Training PAL near may be a principled alternative to spectral radius regularisation of recurrent weights.
9.5 Summary: PAL vs. Ising-based Methods
| Criterion | Ising / RFIM | PAL | Advantage |
|---|---|---|---|
| Data model | Static graph, couplings | Sequential stream | PAL |
| Update cost | per new node | per new token | PAL |
| Forgetting | None (quenched disorder) | Significance-based wiping | PAL |
| Exact optimisation | NP-hard in general | Approximation only | Ising |
| Equilibrium guarantees | Yes (Gibbs measure) | No (causal only) | Ising |
| Universality class | Known () | Inherited from RFIM | Equal |
| Coupling structure | Arbitrary | Mean-field implicit | Ising |
| Sequence modelling | Not native | Native (rate-independent) | PAL |
10 Conclusion
We introduced the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in classical hysteresis theory. The results establish:
-
1.
PAL-Transformer is Turing-complete at depth , improving on the depth of standard transformers (theorem˜4.6). Moreover, a vector PAL (vPAL) with two-dimensional signal projection achieves Turing completeness with a single head (), establishing that signal dimension and head count are exchangeable resources in PAL (corollaries˜4.8 and 4.11).
-
2.
The function classes of PAL and transformer are incomparable, with rate-independence as the separating property (theorems˜5.1 and 5.7).
-
3.
PAL corresponds exactly to Extremum First-Order Logic (EFO), a strict fragment of the + Aggregate class corresponding to transformers (theorem˜6.2).
-
4.
PAL is the learned, sequential, causal generalisation of the mean-field Random-Field Ising Model at , inheriting its universality class and phase-transition structure (propositions˜9.1 and 9.8).
PAL is thus a natural architecture for tasks with long episodic memory, where the significance of past events matters more than their recency or position, and where the problem structure is naturally binary and threshold-driven.
Open questions.
-
1.
Empirical validation: Do the predicted task advantages materialise in practice? Experiments on state-tracking benchmarks (MQAR, SCROLLS) would test theorem˜5.1 empirically.
-
2.
Learning dynamics and criticality: Does gradient-based learning of drive the measure toward the critical disorder ? Is training near criticality beneficial empirically, analogous to the edge-of-chaos effect?
-
3.
Beyond mean-field: Can the equivalence in proposition˜9.1 be extended beyond mean-field RFIM to short-range interactions (Bethe lattice, finite-dimensional RFIM)? This would connect PAL to a richer universality class.
-
4.
Hybrid architectures: Can PAL heads be combined with standard attention heads to obtain both random access and significance-based memory?
-
5.
Continuous-time extension: The rate-independence of PAL suggests a natural extension to continuous-time sequence models (neural ODEs, S4), where the driving signal is a continuous path rather than a discrete sequence.
-
6.
Vector PAL and multi-dimensional inputs: corollary˜4.8 establishes that vPAL with a two-dimensional signal is Turing-complete at head — reducing the head count from (scalar PAL) to by exploiting the two independent extremum stacks of the vector signal. This follows the superposition of Preisach half-planes developed by Frydrych (2019) for two-axis fluxgate sensors (Chapters 4.2.2–4.2.6), where the full magnetisation vector is integrated over the three-dimensional space with displacement and rotation components. Three questions remain open for vPAL: (a) Does the vector RFIM connection extend to the anisotropic case studied by Frydrych, where the measure breaks rotational symmetry? (b) Can the domain-rotation component be interpreted as a differentiable residual that complements the binary relay — analogous to soft attention complementing hard attention? (c) Does the exponential growth of the measure parameter space ( for -dimensional vPAL) create a fundamental expressiveness–efficiency tradeoff absent in scalar PAL?
Broader Impact Statement
This work introduces a theoretical architecture for sequence modelling grounded in classical hysteresis theory. The primary contributions are mathematical — Turing completeness proofs, expressiveness separations, and logical characterisations — and do not directly enable any specific application.
The connection to the Random-Field Ising Model (Section 8) suggests potential applications in combinatorial optimisation, associative memory, and belief propagation. These are established areas of machine learning with broad beneficial applications. We are not aware of direct pathways from this theoretical work to harmful applications.
If implemented in practice, PAL-based models would share the general risks of machine learning systems: potential for bias amplification, misuse in surveillance or manipulation, and environmental cost of training large models. These risks are not specific to PAL and are addressed by general ML ethics guidelines.
The computational complexity of PAL (versus for attention) may reduce the energy cost of training long-context models, which is a potential positive environmental impact.
References
- What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, External Links: Link Cited by: §1.
- Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters 55 (14), pp. 1530–1533. External Links: Document Cited by: item 3, §9.3.2.
- The logical expressiveness of graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Corollary A.9, item 4, §1, §6, §8.
- Preisach modeling of LiFePO4 lithium-iron-phosphate battery hysteresis. Journal of Energy Storage 2, pp. 65–72. External Links: Document Cited by: §8.
- Titans: learning to memorize at test time. In arXiv preprint arXiv:2501.00663, External Links: Link Cited by: §8.
- Longformer: the long-document transformer. In arXiv preprint arXiv:2004.05150, External Links: Link Cited by: §8.
- Hysteresis and phase transitions. Applied Mathematical Sciences, Vol. 121, Springer, New York. External Links: Document Cited by: Proposition 2.3.
- Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. External Links: Document Cited by: §A.2, §B.2, item 2.
- Hysteresis loop critical exponents in dimensions. Physical Review Letters 71 (20), pp. 3222–3225. External Links: Document Cited by: §9.1, §9.4.
- New portfolio risk optimisation method for strongly dependent assets. Journal of Engineering Studies and Research 20 (3), pp. 30–37. Cited by: §1, item 4, §8.
- Modelowanie charakterystyk magnesowania amorficznych rdzeni dwuosiowych sensorów transduktorowych. Ph.D. Thesis, Politechnika Warszawska, Wydział Mechatroniki, Warsaw, Poland. Cited by: §1, item 6, Definition 3.2, §4.2, §8.
- Parity, circuits, and the polynomial-time hierarchy. In Mathematical Systems Theory, Vol. 17, pp. 13–27. External Links: Document Cited by: Corollary 4.7.
- Mamba: linear-time sequence modeling with selective state spaces. In arXiv preprint arXiv:2312.00752, External Links: Link Cited by: §1, §8.
- Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics 8, pp. 156–171. External Links: Document Cited by: §1, §5.1, §8.
- Computational limitations of small-depth circuits. MIT Press, Cambridge, MA. Cited by: §5.1.
- Introduction to automata theory, languages, and computation. Addison-Wesley, Reading, MA. Cited by: §4.2, §4.
- Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79 (8), pp. 2554–2558. External Links: Document Cited by: §9.3.2.
- Computation at the edge of chaos: phase transitions and emergent computation. Physica D: Nonlinear Phenomena 42 (1–3), pp. 12–37. External Links: Document Cited by: Remark 9.9.
- Ising formulations of many NP problems. Frontiers in Physics 2, pp. 5. External Links: Document Cited by: §9.3.1.
- Mathematical models of hysteresis. Springer, New York. External Links: Document Cited by: item 4, Proposition 2.4, Proposition 2.5.
- The parallelism tradeoff: limitations of log-precision transformers. Transactions of the Association for Computational Linguistics 11, pp. 531–545. External Links: Document Cited by: §8.
- Information, physics, and computation. Oxford University Press, Oxford. External Links: Document Cited by: §9.3.3.
- RWKV: reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048–14077. External Links: Document Cited by: §8.
- Attention is Turing complete. Journal of Machine Learning Research 22 (75), pp. 1–35. External Links: Link Cited by: item 2, §1, §2.3, §4.2, §8.
- Über die magnetische Nachwirkung. Zeitschrift für Physik 94 (5–6), pp. 277–302. External Links: Document Cited by: §1, §2.1.
- Random-field ising models of hysteresis. The Science of Hysteresis 2, pp. 107–179. External Links: Link Cited by: §9.1, §9.4, Remark 9.2.
- Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, External Links: Link Cited by: §1.
- On the computational power of neural nets. Journal of Computer and System Sciences 50 (1), pp. 132–150. External Links: Document Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §1.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, Vol. 202, pp. 35151–35174. Cited by: §1.
- Big Bird: transformers for longer sequences. In Advances in Neural Information Processing Systems, Vol. 33, pp. 17283–17297. Cited by: Appendix C, §8.
Appendix A Full Proof of Theorem 6.2: PAL corresponds to EFO
We prove Theorem 6.2 by structural induction on EFO formula complexity, establishing a constructive correspondence between EFO formulae and PAL-Transformer computations.
A.1 Formal Setup
We work over sequences where each . Let denote the set of extremal positions of .
Definition A.1 (EFO syntax).
The grammar of Extremum First-Order Logic (EFO) over sequences is:
where is a constant, ranges over extremal positions only, denotes that extremal position occurred before extremal position in the sequence, and denotes:
| (25) |
where is a measurable function. The ordering relation is needed to express the causal order of relay state changes, since the relay state at time depends on which of the activation and deactivation thresholds was crossed most recently.
Definition A.2 (EFO semantics).
A sequence satisfies (written ) according to the standard first-order semantics restricted to extremal positions:
-
•
iff for the current assignment of .
-
•
iff there exists such that .
-
•
Boolean connectives have their standard meaning.
-
•
sums over extremal positions satisfying .
A.2 The Correspondence
We establish a bijection between EFO formulae and PAL computations by defining a compilation map .
Lemma A.3 (Extremal position indicator).
The indicator is computable by a single-head PAL as:
| (26) |
where is the scalar projection of the sequence (one specific scalarisation; any injective suffices for the lemma to hold). That is, the relay changes state at step if and only if is a local extremum.
Proof.
Lemma A.4 (Threshold comparison).
The formula at an extremal position is computable by a PAL with thresholds , as:
| (27) |
Proof.
Direct from Definition 2.1: when the input crosses threshold , the relay switches to 1; when it falls below , it switches to 0. In the limit , this detects at each extremal step. ∎
Lemma A.5 (Boolean combinations via MLP).
Let be two EFO formulae computable by PAL heads producing outputs . Then , , and are computable by a two-layer MLP applied to .
Proof.
Boolean functions on are representable by two-layer networks with ReLU activations (Cybenko 1989):
These are exact (not approximate) representations on . ∎
Lemma A.6 (Extremum aggregation).
The formula is computable by a single MPAL head with measure:
| (28) |
where recovers the value encoded at thresholds .
Proof.
The MPAL output for a single head with measure is:
At time , if and only if the current extremum stack contains a pair with and , i.e. the relay encodes an active extremal position satisfying the threshold condition.
Setting holds at makes the PAL output equal to (25), since the relay sums over exactly those extremal positions where holds, weighted by . ∎
Lemma A.7 (Existential extremal quantification).
The formula is computable by a PAL followed by a threshold MLP:
| (29) |
Proof.
counts the number of extremal positions satisfying . This is positive iff at least one such position exists, which is exactly the semantics of . The threshold is implemented by a ReLU followed by a Heaviside (approximated to arbitrary precision by a steep sigmoid, exact under unit-cost arithmetic). ∎
A.3 Inductive Proof of Theorem 6.2
Proof of Theorem 6.2.
We prove both directions by structural induction on EFO formula complexity.
() Every PAL-computable function is EFO-definable.
We show that every primitive PAL operation corresponds to an EFO formula.
Base case — single relay: We show is EFO-definable. By the wiping property (proposition˜2.4) and stack sufficiency (proposition˜2.7), the relay state at time is determined entirely by the most recent state-changing extremum. Specifically, iff there exists an extremal position at which the relay was activated (), and no subsequent extremal position deactivated it (). This is expressed in EFO with ordering as:
| (30) |
The existential quantifier ranges over extremal positions by definition˜A.1; the comparison is atomic; the ordering uses the causal order in EFO; and is atomic. Hence (30) is an EFO formula.
Inductive case — weighted sum: is a weighted sum of relay states. By the inductive hypothesis each is EFO-definable; their weighted sum is , which is EFO by Definition A.1.
MLP layers: MLP computes Boolean combinations (Lemma A.5) and threshold functions of PAL outputs, all of which are EFO-definable by induction.
Multi-head composition: The MPAL output is a sum of single-head PAL outputs; EFO is closed under addition (as a special case of ).
By induction, the entire PAL-Transformer output is EFO-definable.
() Every EFO formula is PAL-computable.
We show that each EFO construct is implementable by a PAL component.
Atomic formula : Implemented by Lemma A.4 with , .
Boolean combinations: Implemented by Lemma A.5.
Existential quantification : Implemented by Lemma A.7.
Extremum aggregation : Implemented by Lemma A.6 with measure holds at .
Nested formulae: Suppose where is an EFO formula. By the inductive hypothesis, is implementable by a PAL sub-computation producing output for each extremal position . Then is implemented by a PAL head with measure , where is the output of the sub-computation. This composes PAL layers — at most one additional layer per level of nesting.
By induction on formula depth, every EFO formula is implementable by a PAL-Transformer with depth proportional to the nesting depth of operators. ∎
A.4 Corollaries of the EFO Correspondence
Corollary A.8 (Decidability of PAL expressiveness).
Given a function , deciding whether reduces to deciding whether is EFO-definable, which is decidable for functions over finite alphabets by standard model-theoretic methods.
Corollary A.9 (EFO FO + Aggregate).
is strictly contained in (the logic corresponding to soft-attention transformers (Barceló et al., 2020)):
The strict inclusion is witnessed by : definable in with positional quantification , but not in (which cannot quantify over non-extremal positions). The reverse inclusion fails because is definable in but not in bounded-depth (Proposition 5.3).
Corollary A.10 (PAL depth vs. formula nesting).
A -layer PAL-Transformer implements EFO formulae of nesting depth at most . Conversely, an EFO formula of nesting depth requires at most PAL layers. This gives a precise depth-expressiveness tradeoff: depth PAL EFO-depth .
Appendix B Auxiliary Lemmas for Section 4
B.1 Correctness of Stack Encoding under Repeated Symbols
Here we verify that the Cantor-depth encoding (Definition 4.1) handles stacks with repeated symbols correctly across all three operations.
Lemma B.1 (POP removes exactly one element).
Under the Cantor-depth encoding, the POP signal removes exactly the topmost pair and no other.
Proof.
The wiping property removes all pairs with . Since (strict monotonicity), we have , so the pair is wiped. Since implies is not less than , the pair is preserved. (More precisely, wiping removes ; is vacuously false since we check strict inequality.) Hence exactly one pair is removed. ∎
Lemma B.2 (PUSH does not trigger wiping).
Under the Cantor-depth encoding, the PUSH signal does not trigger the wiping property.
Proof.
By Lemma 4.2, for all . Therefore (since for some and the gap is smaller than the inter-level gap , provided ). Since , the wiping condition is not satisfied. A new pair is added without removing any existing pairs. ∎
B.2 MLP Width for Transition Function
Lemma B.3 (MLP width for finite transition functions).
For a 2-PDA with states, input alphabet , and stack alphabet , the transition function is implementable by a two-layer MLP of width .
Proof.
has domain of size and is a finite lookup table. By the universal approximation theorem (Cybenko, 1989), any function on a finite domain of size is exactly representable by a two-layer network with hidden units through a table-lookup construction: for each input configuration , one hidden neuron fires for that configuration (indicator neuron) and outputs the corresponding transition values. Width: one neuron per table entry . ∎
Appendix C Relation to Sparse Attention
PAL can be viewed as a form of content-adaptive sparse attention where the sparsity pattern is determined by the sequence of local extrema rather than by position (sliding window) or random selection (BigBird (Zaheer et al., 2020)).
Proposition C.1 (PAL as extremal sparse attention).
PAL implements attention over the set of extremal positions:
| (31) |
where is a content-determined weight and is the relay value. The attention mask is , which is determined by the input content, not position.
Proof.
At time , the PAL output sums over relay pairs where the relay is active. A relay is active at time only if the most recent state change was an activation at some extremal position with . This is equivalent to attending to extremal positions, with weights given by the measure . ∎
This framing positions PAL as a principled alternative to heuristic sparse attention patterns: the sparsity is not imposed externally but emerges naturally from the structure of the Preisach operator.