feat: Implement Deterministic Code-Level Guardrails for Agent Safety#7800
Open
Saurav-Gupta-13 wants to merge 1 commit into
Open
feat: Implement Deterministic Code-Level Guardrails for Agent Safety#7800Saurav-Gupta-13 wants to merge 1 commit into
Saurav-Gupta-13 wants to merge 1 commit into
Conversation
Author
|
Hi team, just wanted to ping that this PR introduces a structural fix for the safety vulnerabilities outlined in #7770. I have tested the blast-radius interceptor and SQLite persistent state locally, and it successfully blocks destructive commands across context resets. Please let me know if you want me to adjust the SQLite schema or the regex patterns during the review process! |
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #7770
The Problem
As reported in #7770, prompt-based safety and system instructions are fundamentally broken. LLMs suffer from context window degradation and jailbreaks, allowing them to bypass prompt rules and execute destructive commands (resulting in massive infrastructure losses).
The Solution
This PR introduces a deterministic Code-Based Governance architecture.
_code_executor_agent.pythat parses the AST and regex footprint of commands before OS execution.rm -rf,terraform,aws) suspend the execution thread and require a strict human terminalCONFIRMtoken.GuardrailStateis updated toRESTRICTEDin a persistent SQLite database. Even if the LLM's memory is wiped or restarted, it remains locked down until a human resets it.This completely isolates the "Brain" from the "Hands" using a zero-trust model.