Operational proof

Governance becomes infrastructure
once AI SDLCs scale.

Three flagship demos and a benchmark suite. Start with the product loop: a Python agent proposes violating code, Mneme catches it, the agent retries with compliant output. Then see how drift compounds without governance, and how invariants hold across multiple actors. The remediation is a governance layer that lives outside any single coding tool.

By Theo Valmis · Published Apr 2026 · Restructured May 2026

Open source · Mneme dogfoods this on its own repo · 117 passing benchmark scenarios

01 Â· The problem

What review-based governance can no longer absorb

The same three pathologies show up everywhere AI assistance is taken seriously: reasonable-looking code that quietly violates architecture, reviewers who cannot keep pace with parallel agents, and drift that propagates faster than humans can detect it. The flagship demos below each isolate one of these and show what changes when a governance layer sits upstream of generation.

Drift

Reasonable code, wrong architecture

Agents pattern-match against training data, not against the decisions your team already made. The output is fluent and confidently violates an ADR no one reread this week.

Review collapse

Reviewers cannot scale linearly

One reviewer cannot evaluate the architectural implications of ten parallel agent-produced PRs per hour. The bottleneck is structural, not effort.

Compounding

Local fixes become systemic drift

Agent A introduces a divergence. Agent B builds on it. Agent C adds infrastructure around it. By the time anyone notices, the architecture has silently forked.

How Mneme works

Mneme compiles your Architectural Decision Records (ADRs) into an executable governance corpus. Every agent proposal is evaluated against this corpus before generation, ensuring architectural invariants hold across tools, sessions, and actors.

02 Â· Flagship demos

Three manifestations of the same governance problem

Each flagship is a category-level narrative: what fails without a governance layer, what holds with one, and where the evidence lives. They are designed to be cited, demoed, and walked through end-to-end — not skimmed.

Flagship 01 · Centerpiece Runnable Mneme dogfoods this

Governed Python agent — from bad code to compliant output

A Python coding agent proposes from MnemeHQ.memory_store import MemoryStore. Locally reasonable — MnemeHQ is the brand name the agent has seen everywhere. Architecturally invalid — it violates ADR-005 (Brand vs Package Namespace Enforcement). Mneme retrieves the compiled decision, blocks the violation, injects the context, and the agent retries with from mneme.memory_store import MemoryStore. This is the product loop in full.

EVIDENCE Live animation + full enforcement trace

RUNS python examples/demo-adr-import.py

SHOWS Violation detection, retry context injection, compliant output

Walk through the product loop →

Try it on your own repo: pip install mneme && mneme init && mneme check or request a pilot →

Flagship 02 Runnable

Architectural drift prevention

A six-step timeline. An agent proposes reasonable-looking code that violates ADR-001. Three downstream changes amplify the divergence. A human reviewer would plausibly miss it. Mneme detects the invariant violation upstream, emits an enforcement trace explaining why, and the agent retries within the constraints. The system converges instead of forking.

EVIDENCE Timeline visualization + enforcement trace

RUNS python examples/architectural-drift/run.py

SHOWS Drift propagation, upstream block, retry convergence

Walk through the timeline →

Flagship 03 Forward-looking Runnable

Governance continuity across multiple actors

Three agents act sequentially against the same codebase. Agent A introduces a divergence. Agent B builds on it. Agent C tries to remediate. Mneme evaluates the architectural invariants at every step. The point isn't multi-agent runtime sophistication — it's that the governance layer remains coherent across actors, sessions, and retries. As AI execution becomes distributed and persistent, governance becomes the coordination layer.

EVIDENCE Scripted three-actor trace + governance log

RUNS python examples/multi-agent-governance/run.py

SHOWS Invariant persistence under parallel/sequential actors

See the governance trace →

03 Â· Supporting enforcement examples

Is the enforcement real? Yes, here are the deterministic verdicts

If the flagships answer why does this category exist, the supporting examples answer is the enforcement actually deterministic. Each one is a single-violation walkthrough: a concrete ADR, the diff an agent would generate, and the exact mneme check verdict.

PASS ADR-001

Storage decision enforcement

JSON-only storage. The agent extends the existing module instead of proposing a Postgres migration.

Read →

WARN Approved-deps

Dependency policy enforcement

An unapproved dependency (sqlalchemy) is flagged with a structured WARN and a tracked override path.

Read →

FAIL ADR-004

Repository pattern enforcement

An ADR-004 violation in user.service.ts hard-fails mneme check in CI.

Read →

04 · Infrastructure features

How the governance corpus is built

The flagship demos work because ADR decisions are compiled into an executable corpus before generation happens. The feature below is the foundation layer — it turns existing architectural documentation into the structured decision store the demos run against.

Infrastructure feature · ADR compiler

ADR compiler — turn architectural decisions into infrastructure

The governed Python agent demo works because ADR-005 was compiled from docs/adr/ into project_memory.json. This page explains the compiler step: parse, validate, resolve precedence, emit. No vector store. No ML. Deterministic every run.

See how ADRs become the governance corpus →

05 · Operational evidence

The benchmark, the integrations, the trace format

The flagships and supporting demos sit on top of three operational artifacts: a reproducible scenario benchmark, hook-level integrations with the tools teams already use, and a structured governance trace that drives the CI gate.

Benchmark

Governance Benchmark v1.1

Deterministic scenario suite, structured-output verification, pre-registered thresholds. 18 drift scenarios. 117 passing tests in v0.3.0.

Methodology →

Integrations

Claude Code, Cursor, GitHub Actions, ADR import

Pre-generation hooks for editors, post-generation enforcement in CI, and a corpus importer for ADRs that already exist in docs/adr/.

Integrations index →

Trace format

Governance violations reference

PASS / WARN / FAIL with decision IDs. The same structured trace gates pull requests, feeds dashboards, and drives the retry loop.

Reference →

Three lines to wire this up on your own repo

Open source, MIT. Same decision corpus drives the editor hook, the CI gate, and the ADR compiler.

pip install mneme && mneme init && mneme check

Want help wiring it up? Request a pilot — we'll compile your ADRs and walk the enforcement trace on your own repo.

FAQ

Common questions about the demo structure

Why three flagship demos instead of a single feature tour?

Each flagship demonstrates a different manifestation of the same structural problem. The governed Python agent demo shows the product outcome: AI proposes violating code, Mneme catches it, the agent retries with compliant output. Architectural drift prevention shows the failure mode over time. Multi-agent governance shows that constraints have to hold across actors, sessions, and retries. The ADR compiler is the infrastructure feature that makes all of them possible — it compiles existing ADRs into the executable corpus the flagships run against.

What is the difference between flagship and supporting demos?

Flagship demos answer why does this category exist? Supporting demos answer is the enforcement real? The supporting demos — storage decision, dependency policy, repository pattern — are short, deterministic enforcement examples a senior engineer can verify in 30 seconds. The flagships are systemic narratives showing where one-off enforcement compounds into infrastructure.

Are the runnable examples real or scripted?

The supporting demos are deterministic — same input, same Mneme verdict, every run. The ADR compiler flagship runs against Mneme's own ADRs and ships a Python walkthrough (examples/demo-adr-import.py in mneme-project-memory/) that imports, applies, and enforces end-to-end. The drift and multi-agent governance flagships ship lightweight reproducible scripts that simulate the orchestration; the enforcement and conflict-detection steps are the real Mneme pipeline. The point is to demonstrate governance coherence, not to claim a multi-agent runtime.

How is this different from CLAUDE.md or .cursor/rules?

CLAUDE.md and .cursor/rules are static text files the model is asked to respect. Mneme is a structured decision store with a precedence engine and hook-level enforcement, so compliance is not probabilistic. The full breakdown is in why prompt memory fails at scale; the head-to-head comparison is at Mneme vs Cursor Rules.

Where does the benchmark fit?

The benchmark is the evidence layer underneath the flagships. The flagships show why a governance layer is structurally necessary; the benchmark proves the enforcement is deterministic and reproducible across scenarios. See /benchmark/ for the methodology and the v1.1 scenario suite.