Autonomous Multi-Agent RAG Fixing with Human-in-the-Loop Safety
HEAL is a research-driven, autonomous multi-agent system that diagnoses and fixes RAG (Retrieval-Augmented Generation) issues at scale. It extracts quality test cases from JIRA tickets, discovers patterns across failures, and generates fixes with evaluation-driven iterationโall with human oversight at critical decision points.
What makes HEAL different: We don't just buildโwe measure, learn, and pivot based on data. When experiments show our assumptions are wrong (like optimizing for URL F1), we change direction mid-course.
Manual RAG debugging doesn't scale:
- RHEL Lightspeed had 68 JIRA tickets for incorrect answers
- Manual extraction: 21% success rate (hallucinations, no verification)
- Manual fixing: 2-4 hours per ticket, requires SME expertise
- No systematic way to find patterns across similar failures
- Even successful extractions used WRONG docs (reinstall vs update)
Autonomous multi-agent pipeline with research-driven optimization:
graph TB
A[JIRA Tickets] --> B[Scope Check]
B --> C[Multi-Agent Extraction]
C --> D[URL Validation]
D --> E[Quality-Verified Q&A]
E --> F[Pattern Discovery]
F --> G[Evaluation-Driven Fixes]
G --> H[Interactive Review]
H --> I[Jira Updates + PR Creation]
J[Retrieval Research] -.->|Optimize| C
J -.->|87.4% Content Relevance| D
K[Comparison Framework] -->|Data-Driven Pivots| J
style J fill:#e1f5ff
style K fill:#e1f5ff
Results:
- โ 100% extraction success (vs 21% manual)
- โ 60-100x faster than manual approach
- โ 87.4% content relevance with RAG-enhanced retrieval (vs 63.3% baseline)
- โ URL validation catches wrong docs before synthesis
- โ Interactive review gives human approval before commits
- โ Jira automation with dry-run preview mode
- โ Research-driven optimization with data-driven pivots
- Python 3.11+
uvpackage manager (install instructions)- Google Cloud SDK (
gcloud) for authentication - Access to Anthropic Claude via Vertex AI
# 1. Clone repository
git clone <HEAL-repo-url>
cd HEAL
# 2. Install dependencies
uv sync --extra devHEAL uses Claude via Vertex AI. Choose one authentication method:
Option 1: Application Default Credentials (recommended for development)
gcloud auth application-default loginOption 2: Service Account (recommended for CI/production)
# Set path to service account key in .env:
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json# Copy example config
cp .env.example .env
# Edit .env and set your project ID:
# ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-idHEAL auto-detects related repositories if placed adjacently:
parent-directory/
โโโ HEAL/ # This repo
โโโ okp-mcp/ # Auto-detected
โโโ lscore-deploy/ # Auto-detected
โโโ lightspeed-evaluation/ # Auto-detected
OR set paths explicitly in .env:
OKP_MCP_ROOT=/path/to/okp-mcp
LSCORE_DEPLOY_ROOT=/path/to/lscore-deploy
LIGHTSPEED_EVAL_ROOT=/path/to/lightspeed-evaluation# Solr URL (defaults to localhost:8983)
SOLR_URL=http://localhost:8983/solr/portal
# Custom log/worktree directories (defaults to ~/.heal/)
HEAL_LOG_DIR=/custom/path/logs
HEAL_WORKTREE_ROOT=/custom/path/worktrees# Check configuration
uv run python -c "from heal.core.config import HEALConfig; HEALConfig.print_config_summary()"
# Verify imports work
uv run python -c "from heal.agents import LinuxExpertAgent; print('โ
HEAL ready!')"Expected output:
HEAL Configuration:
OKP-MCP root: /path/to/okp-mcp
lscore-deploy root: /path/to/lscore-deploy
lightspeed-eval root: /path/to/lightspeed-evaluation
Solr URL: http://localhost:8983/solr/portal
Log directory: /home/user/.heal/logs
Worktree directory: /home/user/.heal/worktrees
Environment Validation:
โ
okp_mcp_found
โ
log_dir_writable
โ
worktree_dir_writable
โ
solr_url_valid
# Interactive demo (10 tickets, ~5-10 minutes)
./scripts/demo_heal_workflow.sh --quick
# Full demo (68 tickets, ~45-60 minutes)
./scripts/demo_heal_workflow.shโ Error: OKP-MCP repository not found
Place okp-mcp repository adjacent to HEAL:
cd ..
git clone <okp-mcp-repo-url>Or set environment variable:
export OKP_MCP_ROOT=/path/to/okp-mcpโ Error: Solr is not accessible
If Solr is running on a different host/port:
export SOLR_URL=http://your-solr-host:8983/solr/portalFor Docker:
export SOLR_URL=http://host.docker.internal:8983/solr/portalโ Error: Claude authentication failed
Verify ADC is set up correctly:
gcloud auth application-default login
gcloud config set project your-gcp-project-idCheck credentials file exists:
ls -la ~/.config/gcloud/application_default_credentials.jsonDebug Logs
HEAL writes debug logs to ~/.heal/logs/:
solr_multi_agent_debug.log- Multi-agent system callsclaude_sdk_debug.log- Claude SDK interactions
Custom log location:
export HEAL_LOG_DIR=/path/to/logsHEAL doesn't just buildโit measures, learns, and pivots based on data.
Can cheap baseline retrieval (RAG-enhanced edismax) replace expensive multi-agent URL validation?
Hypothesis:
- Need LLM-based URL validation ($0.01-0.05 per query)
- Multi-agent validation necessary for quality
Experiment Design:
# Compare 3 retrieval strategies on BOOTLOADER pattern
uv run python scripts/compare_okp_vs_baseline.py \
--pattern BOOTLOADER_GRUB_ISSUES --detailsMeasured:
- URL F1 (exact URL matches) - traditional metric
- Content Relevance (semantic keyword overlap) - cheap heuristic
- Answer quality spot-checks - ground truth
| Strategy | URL F1 | Content Relevance | Cost |
|---|---|---|---|
| Simple (baseline) | 4.4% | 63.3% | $0 |
| RAG (edismax) | 6.7% | 87.4% โ | $0 |
| okp-mcp (validation) | TBD | TBD | $0.01-0.05/query |
Key Discovery: The "Expected URLs Problem"
- RAG achieved low URL F1 (6.7%) but high content relevance (87.4%)
- Retrieved different but semantically correct docs
- Expected URLs aren't exhaustiveโmany valid answers exist!
Pivot Decision:
- โ Don't optimize for exact URL matches (wrong metric)
- โ DO optimize for content relevance + answer quality
- ๐ฐ Can replace expensive validation with cheap heuristic
- ๐ Save $3-15 per pattern while maintaining quality
# Solr edismax with optimized field boosting
params = {
"defType": "edismax",
"qf": "title^3.0 content^1.0 main_content^1.5 id^2.0",
"pf": "title^10.0 content^5.0 main_content^7.0", # Phrase boosting
"ps": "2", # Phrase slop
"mm": "50%", # Minimum match
}
# โ 87.4% content relevance (vs 63.3% baseline)# Test RAG-enhanced extraction on sample tickets
uv run python src/heal/bootstrap/extract_jira_tickets_rag.py --limit 3
# Compare quality: baseline vs RAG
uv run python scripts/compare_extracted_yamls.py --detailsMetrics compared:
- Answer length (more detailed?)
- URLs retrieved (different docs?)
- Refinement iterations (better first-pass quality?)
- Review scores (higher quality?)
See: docs/RAG_EXTRACTION_TESTING.md for complete testing guide
Goal: Convert JIRA tickets into quality-verified Q&A pairs with source URLs
# Extract from JQL query
uv run python src/heal/bootstrap/extract_jira_tickets.py \
--jql "project = RSPEED AND labels = cla-incorrect-answer AND resolution = Unresolved" \
--output config/extracted_tickets.yaml
# Or extract specific tickets
uv run python src/heal/bootstrap/extract_jira_tickets.py \
--tickets RSPEED-2651,RSPEED-2652,RSPEED-2653
# Force re-extract (update existing tickets)
uv run python src/heal/bootstrap/extract_jira_tickets.py \
--tickets RSPEED-2651 \
--force-reextractWhat happens:
- Scope Check filters meta-tickets, jailbreaks, non-RHEL questions (38% noise filtered)
- Linux Expert forms hypothesis about correct answer
- Solr Expert searches RHEL documentation for verification
- URL Validation Agent โจ NEW: Validates docs BEFORE synthesis
- Catches wrong docs early (e.g., "reinstall" vs "update")
- Retries search with better queries if validation fails
- LinuxExpert synthesizes answer from VALIDATED docs
- Answer Review Agent checks quality (iterates up to 3x until score โฅ 0.7)
Output: config/extracted_tickets.yaml - 100% success on valid RHEL tickets
NEW: In-place URL validation - no full re-extraction needed!
# Read-only validation (just report issues)
uv run python scripts/validate_yaml_urls.py --pattern BOOTLOADER_GRUB_ISSUES
# Auto-fix: search for better URLs (dry-run first)
uv run python scripts/validate_yaml_urls.py \
--pattern BOOTLOADER_GRUB_ISSUES \
--auto-fix \
--dry-run
# Apply fixes (creates .yaml.bak backup)
uv run python scripts/validate_yaml_urls.py \
--pattern BOOTLOADER_GRUB_ISSUES \
--auto-fixWhat it does:
- Searches Solr with each ticket's query
- Validates retrieved docs actually answer the question
- Updates
expected_urlsin pattern YAML if better URLs found - Saves changes in-place with backup
# Analyze tickets to find common failure patterns
uv run python src/heal/pattern_discovery/discover_ticket_patterns.py \
--input config/extracted_tickets.yaml \
--output-tagged config/tickets_with_patterns.yaml \
--output-report config/patterns_report.json \
--min-pattern-size 3Output: Pattern groups with โฅ3 similar tickets (e.g., BOOTLOADER_GRUB_ISSUES, RPM_OSTREE_COMMANDS)
# Generate one YAML per pattern for lightspeed-evaluation
uv run python src/heal/bootstrap/convert_bootstrap_to_eval_format.py \
--tickets config/extracted_tickets.yaml \
--patterns config/patterns_report.json \
--output-dir config/patterns/Output: config/patterns/{PATTERN_ID}.yaml - ready for evaluation
Goal: Fix retrieval/ranking issues with evaluation-driven iteration and human oversight
# Run pattern fix with interactive review
./runners/fix.sh BOOTLOADER_GRUB_ISSUESWhat happens:
- Baseline evaluation identifies the problem (low URL F1, poor answer quality)
- Multi-agent diagnosis (Solr Expert + Code Expert) proposes fix
- โจ Human approval #1: Review reasoning, approve/reject change
- Change applied โ file modified
- Git diff shown
- โจ Human approval #2: Review diff, approve/reject
- If rejected โ
git restore(instant revert) - If approved โ runs test
- If rejected โ
- Test passes โ commits change
- Evaluation checks if metrics improved
- Iterate until stable or max iterations reached
Safety features:
- Two approval checkpoints
- Easy rollback with 'n'
- Test-before-commit
- Git isolation (fix branch, never auto-merges)
# Skip interactive prompts (for automation)
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --yolo# See what WOULD be posted to Jira (doesn't actually post)
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --dry-run-integrations
# Review the preview
cat .diagnostics/BOOTLOADER_GRUB_ISSUES/JIRA_COMMENTS_PREVIEW.mdPreview includes:
- Metrics (before/after comparison)
- Model reasoning for the fix
- Warnings (high variance, RAG quality issues)
- Code changes summary
- Next steps for reviewers
# Actually post to Jira and create PR
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --enable-jira --create-prWhat happens:
- Posts comprehensive comment to each ticket in pattern
- Pushes fix branch to remote
- Creates PR with metrics, reasoning, testing checklist
- Leaves you on fix branch (ready to review/merge)
15+ years RHEL expertise encoded as agent behavior
- Forms hypotheses about correct answers
- Synthesizes verified responses from documentation
- Refines answers based on Review Agent feedback
- Uses Claude Sonnet 4.5 via Vertex AI
Searches RHEL documentation (OKP) for fact verification
- Queries official RHEL knowledge portal
- Returns clean docs + source URLs
- Builds search intelligence database
- Provides confidence scoring
RAG-Enhanced Variant (proven 87.4% content relevance):
- Optimized edismax with field boosting (title^3.0, main_content^1.5)
- Phrase field boosting for better matching
- Phrase slop (ps=2) for fuzzy matching
- Minimum match (mm=50%) to reduce false positives
- Drop-in replacement for testing better retrieval
Validates docs BEFORE synthesis
- Prevents synthesis from wrong docs
- Catches semantic mismatches (e.g., "update" vs "reinstall")
- Retries search with better queries if validation fails
- Reduces answer refinement cycles by ~30%
Quality gatekeeper for production-ready answers
- Scores answers 0.0-1.0 (must score โฅ 0.7)
- Checks against production guidelines:
- Conciseness (no verbose explanations)
- No "based on documentation" phrases
- Complete commands with all parameters
- Proper markdown formatting
- Provides suggested fixes for common issues
- Enables autonomous quality loop
Finds common themes across failures
- LLM-based clustering (Claude Sonnet 4)
- Groups similar failures (โฅ3 tickets per pattern)
- Auto-filters OUT_OF_SCOPE tickets
- Enables batch fixing (10-15 tickets per fix)
Evaluation-driven optimization with human oversight
- Baseline โ diagnose โ fix โ test โ iterate
- Interactive review at two checkpoints
- YOLO mode for automation
- Multi-agent collaboration (Solr + Code experts)
- Commits only if tests pass
config/pattern_fix_config.yaml - Main configuration
eval_root: /path/to/lightspeed-evaluation
okp_mcp_root: /path/to/okp-mcp
lscore_deploy_root: /path/to/lscore-deploy
patterns_dir: config/patterns
max_iterations: 10
stability_runs: 3
validation_cycles: 3 # Outer loop with full answer validation
# Interactive review (can be overridden with --yolo flag)
interactive: trueRequired:
ANTHROPIC_VERTEX_PROJECT_ID- Your GCP project ID for Claude APIGOOGLE_APPLICATION_CREDENTIALS- GCP credentials (set viagcloud auth)
Optional:
API_KEY- For RHEL Lightspeed API (if testing against live API)
scripts/validate_yaml_urls.py [OPTIONS]
--pattern PATTERN_ID # Specific pattern to validate
--auto-fix # Search for better URLs if validation fails
--dry-run # Preview changes without saving./runners/fix.sh PATTERN_ID [OPTIONS]
Interactive Review (DEFAULT: ON):
--yolo # Auto-approve all changes (skip prompts)
Jira Integration (DEFAULT: OFF):
--enable-jira # Post comments to Jira tickets
--dry-run-integrations # Preview Jira/PR without executing
PR Creation (DEFAULT: OFF):
--create-pr # Create GitHub PR after successful fix
Testing:
--mode single # Test one ticket per pattern
--mode full # Test all tickets in pattern
--max-iterations N # Max Solr optimization iterations
--validation-cycles N # Outer loop cycles with answer validation
--include-judge-reasoning # A/B test: include LLM judge critique# Compare retrieval strategies
scripts/compare_okp_vs_baseline.py [OPTIONS]
--pattern PATTERN_ID # Pattern to test (default: BOOTLOADER_GRUB_ISSUES)
--details # Show per-query iteration details
# Test RAG extraction
src/heal/bootstrap/extract_jira_tickets_rag.py [OPTIONS]
--limit N # Number of tickets to extract
--force-rebuild # Start fresh (ignore existing YAML)
--tickets TICKET_IDS # Specific tickets to test
# Compare YAML quality
scripts/compare_extracted_yamls.py [OPTIONS]
--details # Show per-ticket comparison
--ticket TICKET_ID # Deep-dive on single ticket| Metric | Before HEAL | After HEAL | Improvement |
|---|---|---|---|
| Extraction Success | 21% | 100% | 4.8x |
| Time to Extract | 2-4 hours (manual) | 10-15 min (autonomous) | 10-20x faster |
| Answer Quality | Unverified | Score โฅ 0.7 (validated) | โ Production-ready |
| Source Traceability | None | Every answer has URLs | โ Auditable |
| Scope Detection | Manual triage | Auto-filters 38% noise | โ Intelligent |
| Security | Vulnerable to jailbreaks | Auto-blocks attacks | โ Protected |
| URL Accuracy | Unknown | Validated before synthesis | โ Verified |
- โ 42 RHEL tickets extracted (100% success)
- ๐ซ 26 meta-tickets filtered (38% noise reduction)
- 8 jailbreak attempts blocked
- 18 meta-tickets about CLA behavior
- โฑ๏ธ Total time: 1-1.5 hours vs 100+ hours manual
- ๐ URL validation: ~30% reduction in answer refinement cycles
- ๐ฌ RAG-enhanced edismax: 87.4% content relevance (vs 63.3% baseline)
- ๐ก Discovery: URL F1 is wrong metricโoptimize for semantic relevance instead
- ๐ฐ Cost savings: Can replace $0.01-0.05/query validation with $0 heuristic
- ๐ Research-driven: Data shows when to pivot strategy mid-experiment
Scope Check (Pre-Flight Filter):
- Detects meta-tickets about AI behavior
- Blocks jailbreak attempts and prompt injection
- Filters non-RHEL questions (Windows, Ubuntu, etc.)
- Runs BEFORE expensive LLM calls
- Result: 8 jailbreak attempts blocked (0% success)
Interactive Review:
- Human approval before code changes
- Two checkpoints: reasoning + diff
- Easy rollback with 'n'
- Git isolation (fix branch only, never auto-merge)
Dry-Run Mode:
- Preview Jira comments before posting
- Test integrations safely
- Review changes before applying
Every answer includes:
- โ Source URLs from official documentation
- โ Confidence scores (Solr + Review Agent)
- โ Reasoning for answers
- โ Evaluation metrics (URL F1, answer correctness, etc.)
- โ Git commits with full context
- Quick Start - One-page overview
- Demo Guide - Complete demo script with new features
- Bootstrap Guide - Detailed bootstrap workflow
- Demo Plan - 30-45 minute presentation plan
- Presentation Slides - Slide deck outline highlighting research approach
- Design Intent - Architecture overview
- OKP MCP Agent - Fix loop implementation
- Multi-Agent Extraction - Bootstrap details
- Pattern Discovery - Clustering approach
- Retrieval Optimization Findings - RAG research results
- RAG Integration Guide - How to use RAG findings
- RAG Extraction Testing - Testing RAG in bootstrap
- Comparison Summary - What was tested
- AGENTS.md - Guidelines for AI agents working on this codebase
- Testing - Test suite documentation
- Coverage - Coverage tracking
We welcome contributions! Please see:
- Code style:
black,ruff,pylint - Tests:
pytestwithpytest-mock - Quality checks:
make pre-commit - Documentation: Update relevant
.mdfiles
HEAL's architecture is product-agnostic. Adapt it to:
| Component | RHEL | Other Products |
|---|---|---|
| Expert Agent | Linux Expert | Swap โ Product Expert |
| Search Backend | Solr (OKP) | Any doc search API |
| Review Guidelines | RHEL-specific | Configure in YAML |
| Pattern Discovery | Domain-independent | No changes needed |
Potential applications:
- OpenShift documentation
- Kubernetes knowledge bases
- Enterprise software support
- Medical/legal information systems
- Any domain with authoritative documentation
- Multi-agent extraction with autonomous quality loop
- URL validation before synthesis
- Interactive review with two approval checkpoints
- Pattern discovery and clustering
- Jira integration with dry-run preview
- PR creation automation
- Git safety (fix branches, no auto-merge)
- Retrieval optimization research (RAG vs baseline comparison)
- RAG-enhanced Solr Expert (87.4% content relevance)
- Comparison framework for benchmarking strategies
- Data-driven pivot discovery (Expected URLs Problem)
- RAG extraction validation (testing if 87.4% content relevance โ better answers)
- A/B testing of judge reasoning impact
- Correlation analysis (content relevance vs answer quality)
- Search intelligence analytics
- Pattern database integration
- Multi-repository PR coordination
- Post-merge Jira automation
- Cost tracking and optimization
- Model escalation for hard problems
- Self-healing with pattern database
[License information]
Built with:
- Claude (Sonnet 4.5, Opus 4) via Vertex AI
- lightspeed-evaluation - Evaluation framework
- claude-agent-sdk - Multi-agent orchestration
- GitHub Issues: [Coming Soon]
- Discussions: [Coming Soon]
- Email: [Contact information]
HEAL: Transforming RAG debugging from manual, error-prone work into autonomous, validated, scalable automationโwith research-driven optimization and human oversight at critical points.
"We don't just buildโwe measure, learn, and pivot based on data."
Status: Production Deployed | Version: 2.0 | Last Updated: April 2026