Skip to content

feat: Implement Deterministic Code-Level Guardrails for Agent Safety#7800

Open
Saurav-Gupta-13 wants to merge 1 commit into
microsoft:mainfrom
Saurav-Gupta-13:feature/deterministic-guardrails
Open

feat: Implement Deterministic Code-Level Guardrails for Agent Safety#7800
Saurav-Gupta-13 wants to merge 1 commit into
microsoft:mainfrom
Saurav-Gupta-13:feature/deterministic-guardrails

Conversation

@Saurav-Gupta-13

Copy link
Copy Markdown

Resolves #7770

The Problem

As reported in #7770, prompt-based safety and system instructions are fundamentally broken. LLMs suffer from context window degradation and jailbreaks, allowing them to bypass prompt rules and execute destructive commands (resulting in massive infrastructure losses).

The Solution

This PR introduces a deterministic Code-Based Governance architecture.

  1. GuardrailInterceptor: A middleware hook injected directly into _code_executor_agent.py that parses the AST and regex footprint of commands before OS execution.
  2. Blast-Radius Protection: If an agent attempts multiple destructive commands in a single turn, the block immediately fails.
  3. Mandatory Dry-Runs: Destructive actions (rm -rf, terraform, aws) suspend the execution thread and require a strict human terminal CONFIRM token.
  4. Persistent SQLite State: If an agent violates safety rules, its GuardrailState is updated to RESTRICTED in a persistent SQLite database. Even if the LLM's memory is wiped or restarted, it remains locked down until a human resets it.

This completely isolates the "Brain" from the "Hands" using a zero-trust model.

@Saurav-Gupta-13

Copy link
Copy Markdown
Author

Hi team, just wanted to ping that this PR introduces a structural fix for the safety vulnerabilities outlined in #7770. I have tested the blast-radius interceptor and SQLite persistent state locally, and it successfully blocks destructive commands across context resets. Please let me know if you want me to adjust the SQLite schema or the regex patterns during the review process!

@Saurav-Gupta-13

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant