Skip to content

emac-E/HEAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

HEAL: Heuristic Engine for Autonomous Labor

Autonomous Multi-Agent RAG Fixing with Human-in-the-Loop Safety

Python 3.11+ Code style: black


๐ŸŽฏ What is HEAL?

HEAL is a research-driven, autonomous multi-agent system that diagnoses and fixes RAG (Retrieval-Augmented Generation) issues at scale. It extracts quality test cases from JIRA tickets, discovers patterns across failures, and generates fixes with evaluation-driven iterationโ€”all with human oversight at critical decision points.

What makes HEAL different: We don't just buildโ€”we measure, learn, and pivot based on data. When experiments show our assumptions are wrong (like optimizing for URL F1), we change direction mid-course.

The Problem It Solves

Manual RAG debugging doesn't scale:

  • RHEL Lightspeed had 68 JIRA tickets for incorrect answers
  • Manual extraction: 21% success rate (hallucinations, no verification)
  • Manual fixing: 2-4 hours per ticket, requires SME expertise
  • No systematic way to find patterns across similar failures
  • Even successful extractions used WRONG docs (reinstall vs update)

The HEAL Solution

Autonomous multi-agent pipeline with research-driven optimization:

graph TB
    A[JIRA Tickets] --> B[Scope Check]
    B --> C[Multi-Agent Extraction]
    C --> D[URL Validation]
    D --> E[Quality-Verified Q&A]
    E --> F[Pattern Discovery]
    F --> G[Evaluation-Driven Fixes]
    G --> H[Interactive Review]
    H --> I[Jira Updates + PR Creation]
    
    J[Retrieval Research] -.->|Optimize| C
    J -.->|87.4% Content Relevance| D
    K[Comparison Framework] -->|Data-Driven Pivots| J
    
    style J fill:#e1f5ff
    style K fill:#e1f5ff
Loading

Results:

  • โœ… 100% extraction success (vs 21% manual)
  • โœ… 60-100x faster than manual approach
  • โœ… 87.4% content relevance with RAG-enhanced retrieval (vs 63.3% baseline)
  • โœ… URL validation catches wrong docs before synthesis
  • โœ… Interactive review gives human approval before commits
  • โœ… Jira automation with dry-run preview mode
  • โœ… Research-driven optimization with data-driven pivots

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+
  • uv package manager (install instructions)
  • Google Cloud SDK (gcloud) for authentication
  • Access to Anthropic Claude via Vertex AI

Installation

# 1. Clone repository
git clone <HEAL-repo-url>
cd HEAL

# 2. Install dependencies
uv sync --extra dev

Configuration

Required: Authentication

HEAL uses Claude via Vertex AI. Choose one authentication method:

Option 1: Application Default Credentials (recommended for development)

gcloud auth application-default login

Option 2: Service Account (recommended for CI/production)

# Set path to service account key in .env:
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

Required: Set Your Vertex AI Project

# Copy example config
cp .env.example .env

# Edit .env and set your project ID:
# ANTHROPIC_VERTEX_PROJECT_ID=your-gcp-project-id

Optional: Repository Paths

HEAL auto-detects related repositories if placed adjacently:

parent-directory/
โ”œโ”€โ”€ HEAL/                    # This repo
โ”œโ”€โ”€ okp-mcp/                 # Auto-detected
โ”œโ”€โ”€ lscore-deploy/           # Auto-detected
โ””โ”€โ”€ lightspeed-evaluation/   # Auto-detected

OR set paths explicitly in .env:

OKP_MCP_ROOT=/path/to/okp-mcp
LSCORE_DEPLOY_ROOT=/path/to/lscore-deploy
LIGHTSPEED_EVAL_ROOT=/path/to/lightspeed-evaluation

Optional: Service Configuration

# Solr URL (defaults to localhost:8983)
SOLR_URL=http://localhost:8983/solr/portal

# Custom log/worktree directories (defaults to ~/.heal/)
HEAL_LOG_DIR=/custom/path/logs
HEAL_WORKTREE_ROOT=/custom/path/worktrees

Verify Setup

# Check configuration
uv run python -c "from heal.core.config import HEALConfig; HEALConfig.print_config_summary()"

# Verify imports work
uv run python -c "from heal.agents import LinuxExpertAgent; print('โœ… HEAL ready!')"

Expected output:

HEAL Configuration:
  OKP-MCP root:         /path/to/okp-mcp
  lscore-deploy root:   /path/to/lscore-deploy
  lightspeed-eval root: /path/to/lightspeed-evaluation
  Solr URL:             http://localhost:8983/solr/portal
  Log directory:        /home/user/.heal/logs
  Worktree directory:   /home/user/.heal/worktrees

Environment Validation:
  โœ… okp_mcp_found
  โœ… log_dir_writable
  โœ… worktree_dir_writable
  โœ… solr_url_valid

Run a Quick Demo

# Interactive demo (10 tickets, ~5-10 minutes)
./scripts/demo_heal_workflow.sh --quick

# Full demo (68 tickets, ~45-60 minutes)
./scripts/demo_heal_workflow.sh

Troubleshooting

โŒ Error: OKP-MCP repository not found

Place okp-mcp repository adjacent to HEAL:

cd ..
git clone <okp-mcp-repo-url>

Or set environment variable:

export OKP_MCP_ROOT=/path/to/okp-mcp

โŒ Error: Solr is not accessible

If Solr is running on a different host/port:

export SOLR_URL=http://your-solr-host:8983/solr/portal

For Docker:

export SOLR_URL=http://host.docker.internal:8983/solr/portal

โŒ Error: Claude authentication failed

Verify ADC is set up correctly:

gcloud auth application-default login
gcloud config set project your-gcp-project-id

Check credentials file exists:

ls -la ~/.config/gcloud/application_default_credentials.json

Debug Logs

HEAL writes debug logs to ~/.heal/logs/:

  • solr_multi_agent_debug.log - Multi-agent system calls
  • claude_sdk_debug.log - Claude SDK interactions

Custom log location:

export HEAL_LOG_DIR=/path/to/logs

๐Ÿ”ฌ Retrieval Optimization: Research-Driven Approach

HEAL doesn't just buildโ€”it measures, learns, and pivots based on data.

The Research Question

Can cheap baseline retrieval (RAG-enhanced edismax) replace expensive multi-agent URL validation?

Hypothesis:

  • Need LLM-based URL validation ($0.01-0.05 per query)
  • Multi-agent validation necessary for quality

Experiment Design:

# Compare 3 retrieval strategies on BOOTLOADER pattern
uv run python scripts/compare_okp_vs_baseline.py \
    --pattern BOOTLOADER_GRUB_ISSUES --details

Measured:

  • URL F1 (exact URL matches) - traditional metric
  • Content Relevance (semantic keyword overlap) - cheap heuristic
  • Answer quality spot-checks - ground truth

Data-Driven Findings

Strategy URL F1 Content Relevance Cost
Simple (baseline) 4.4% 63.3% $0
RAG (edismax) 6.7% 87.4% โœ… $0
okp-mcp (validation) TBD TBD $0.01-0.05/query

Key Discovery: The "Expected URLs Problem"

  • RAG achieved low URL F1 (6.7%) but high content relevance (87.4%)
  • Retrieved different but semantically correct docs
  • Expected URLs aren't exhaustiveโ€”many valid answers exist!

Pivot Decision:

  • โŒ Don't optimize for exact URL matches (wrong metric)
  • โœ… DO optimize for content relevance + answer quality
  • ๐Ÿ’ฐ Can replace expensive validation with cheap heuristic
  • ๐Ÿ“Š Save $3-15 per pattern while maintaining quality

RAG-Enhanced Configuration (Proven)

# Solr edismax with optimized field boosting
params = {
    "defType": "edismax",
    "qf": "title^3.0 content^1.0 main_content^1.5 id^2.0",
    "pf": "title^10.0 content^5.0 main_content^7.0",  # Phrase boosting
    "ps": "2",      # Phrase slop
    "mm": "50%",    # Minimum match
}
# โ†’ 87.4% content relevance (vs 63.3% baseline)

Testing RAG in Bootstrap

# Test RAG-enhanced extraction on sample tickets
uv run python src/heal/bootstrap/extract_jira_tickets_rag.py --limit 3

# Compare quality: baseline vs RAG
uv run python scripts/compare_extracted_yamls.py --details

Metrics compared:

  • Answer length (more detailed?)
  • URLs retrieved (different docs?)
  • Refinement iterations (better first-pass quality?)
  • Review scores (higher quality?)

See: docs/RAG_EXTRACTION_TESTING.md for complete testing guide


๐Ÿ“– Complete Workflows

Workflow 1: Bootstrap - Extract Test Cases from JIRA

Goal: Convert JIRA tickets into quality-verified Q&A pairs with source URLs

Stage 1: Extract Tickets with Multi-Agent Verification

# Extract from JQL query
uv run python src/heal/bootstrap/extract_jira_tickets.py \
    --jql "project = RSPEED AND labels = cla-incorrect-answer AND resolution = Unresolved" \
    --output config/extracted_tickets.yaml

# Or extract specific tickets
uv run python src/heal/bootstrap/extract_jira_tickets.py \
    --tickets RSPEED-2651,RSPEED-2652,RSPEED-2653

# Force re-extract (update existing tickets)
uv run python src/heal/bootstrap/extract_jira_tickets.py \
    --tickets RSPEED-2651 \
    --force-reextract

What happens:

  1. Scope Check filters meta-tickets, jailbreaks, non-RHEL questions (38% noise filtered)
  2. Linux Expert forms hypothesis about correct answer
  3. Solr Expert searches RHEL documentation for verification
  4. URL Validation Agent โœจ NEW: Validates docs BEFORE synthesis
    • Catches wrong docs early (e.g., "reinstall" vs "update")
    • Retries search with better queries if validation fails
  5. LinuxExpert synthesizes answer from VALIDATED docs
  6. Answer Review Agent checks quality (iterates up to 3x until score โ‰ฅ 0.7)

Output: config/extracted_tickets.yaml - 100% success on valid RHEL tickets

Stage 2: Validate/Fix URLs in Pattern YAMLs

NEW: In-place URL validation - no full re-extraction needed!

# Read-only validation (just report issues)
uv run python scripts/validate_yaml_urls.py --pattern BOOTLOADER_GRUB_ISSUES

# Auto-fix: search for better URLs (dry-run first)
uv run python scripts/validate_yaml_urls.py \
    --pattern BOOTLOADER_GRUB_ISSUES \
    --auto-fix \
    --dry-run

# Apply fixes (creates .yaml.bak backup)
uv run python scripts/validate_yaml_urls.py \
    --pattern BOOTLOADER_GRUB_ISSUES \
    --auto-fix

What it does:

  • Searches Solr with each ticket's query
  • Validates retrieved docs actually answer the question
  • Updates expected_urls in pattern YAML if better URLs found
  • Saves changes in-place with backup

Stage 3: Discover Patterns

# Analyze tickets to find common failure patterns
uv run python src/heal/pattern_discovery/discover_ticket_patterns.py \
    --input config/extracted_tickets.yaml \
    --output-tagged config/tickets_with_patterns.yaml \
    --output-report config/patterns_report.json \
    --min-pattern-size 3

Output: Pattern groups with โ‰ฅ3 similar tickets (e.g., BOOTLOADER_GRUB_ISSUES, RPM_OSTREE_COMMANDS)

Stage 4: Convert to Evaluation Format

# Generate one YAML per pattern for lightspeed-evaluation
uv run python src/heal/bootstrap/convert_bootstrap_to_eval_format.py \
    --tickets config/extracted_tickets.yaml \
    --patterns config/patterns_report.json \
    --output-dir config/patterns/

Output: config/patterns/{PATTERN_ID}.yaml - ready for evaluation


Workflow 2: Pattern Fixing - Diagnose & Fix Issues

Goal: Fix retrieval/ranking issues with evaluation-driven iteration and human oversight

Interactive Fix Loop (DEFAULT: Human approval required)

# Run pattern fix with interactive review
./runners/fix.sh BOOTLOADER_GRUB_ISSUES

What happens:

  1. Baseline evaluation identifies the problem (low URL F1, poor answer quality)
  2. Multi-agent diagnosis (Solr Expert + Code Expert) proposes fix
  3. โœจ Human approval #1: Review reasoning, approve/reject change
  4. Change applied โ†’ file modified
  5. Git diff shown
  6. โœจ Human approval #2: Review diff, approve/reject
    • If rejected โ†’ git restore (instant revert)
    • If approved โ†’ runs test
  7. Test passes โ†’ commits change
  8. Evaluation checks if metrics improved
  9. Iterate until stable or max iterations reached

Safety features:

  • Two approval checkpoints
  • Easy rollback with 'n'
  • Test-before-commit
  • Git isolation (fix branch, never auto-merges)

YOLO Mode (Auto-approve all changes)

# Skip interactive prompts (for automation)
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --yolo

Preview Jira Comments (Dry-Run)

# See what WOULD be posted to Jira (doesn't actually post)
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --dry-run-integrations

# Review the preview
cat .diagnostics/BOOTLOADER_GRUB_ISSUES/JIRA_COMMENTS_PREVIEW.md

Preview includes:

  • Metrics (before/after comparison)
  • Model reasoning for the fix
  • Warnings (high variance, RAG quality issues)
  • Code changes summary
  • Next steps for reviewers

Post to Jira & Create PR

# Actually post to Jira and create PR
./runners/fix.sh BOOTLOADER_GRUB_ISSUES --enable-jira --create-pr

What happens:

  • Posts comprehensive comment to each ticket in pattern
  • Pushes fix branch to remote
  • Creates PR with metrics, reasoning, testing checklist
  • Leaves you on fix branch (ready to review/merge)

๐Ÿ—๏ธ Architecture: Five Specialized Agents

1. Linux Expert Agent

15+ years RHEL expertise encoded as agent behavior

  • Forms hypotheses about correct answers
  • Synthesizes verified responses from documentation
  • Refines answers based on Review Agent feedback
  • Uses Claude Sonnet 4.5 via Vertex AI

2. Solr Expert Agent (+ RAG Variant)

Searches RHEL documentation (OKP) for fact verification

  • Queries official RHEL knowledge portal
  • Returns clean docs + source URLs
  • Builds search intelligence database
  • Provides confidence scoring

RAG-Enhanced Variant (proven 87.4% content relevance):

  • Optimized edismax with field boosting (title^3.0, main_content^1.5)
  • Phrase field boosting for better matching
  • Phrase slop (ps=2) for fuzzy matching
  • Minimum match (mm=50%) to reduce false positives
  • Drop-in replacement for testing better retrieval

3. URL Validation Agent โœจ NEW

Validates docs BEFORE synthesis

  • Prevents synthesis from wrong docs
  • Catches semantic mismatches (e.g., "update" vs "reinstall")
  • Retries search with better queries if validation fails
  • Reduces answer refinement cycles by ~30%

4. Answer Review Agent

Quality gatekeeper for production-ready answers

  • Scores answers 0.0-1.0 (must score โ‰ฅ 0.7)
  • Checks against production guidelines:
    • Conciseness (no verbose explanations)
    • No "based on documentation" phrases
    • Complete commands with all parameters
    • Proper markdown formatting
  • Provides suggested fixes for common issues
  • Enables autonomous quality loop

5. Pattern Discovery Agent

Finds common themes across failures

  • LLM-based clustering (Claude Sonnet 4)
  • Groups similar failures (โ‰ฅ3 tickets per pattern)
  • Auto-filters OUT_OF_SCOPE tickets
  • Enables batch fixing (10-15 tickets per fix)

6. Pattern Fix Agent โœจ Interactive Review

Evaluation-driven optimization with human oversight

  • Baseline โ†’ diagnose โ†’ fix โ†’ test โ†’ iterate
  • Interactive review at two checkpoints
  • YOLO mode for automation
  • Multi-agent collaboration (Solr + Code experts)
  • Commits only if tests pass

๐ŸŽ›๏ธ Configuration

Config Files

config/pattern_fix_config.yaml - Main configuration

eval_root: /path/to/lightspeed-evaluation
okp_mcp_root: /path/to/okp-mcp
lscore_deploy_root: /path/to/lscore-deploy

patterns_dir: config/patterns
max_iterations: 10
stability_runs: 3
validation_cycles: 3  # Outer loop with full answer validation

# Interactive review (can be overridden with --yolo flag)
interactive: true

Environment Variables

Required:

  • ANTHROPIC_VERTEX_PROJECT_ID - Your GCP project ID for Claude API
  • GOOGLE_APPLICATION_CREDENTIALS - GCP credentials (set via gcloud auth)

Optional:

  • API_KEY - For RHEL Lightspeed API (if testing against live API)

Command-Line Flags

Bootstrap Flags

scripts/validate_yaml_urls.py [OPTIONS]
  --pattern PATTERN_ID       # Specific pattern to validate
  --auto-fix                 # Search for better URLs if validation fails
  --dry-run                  # Preview changes without saving

Fix Loop Flags

./runners/fix.sh PATTERN_ID [OPTIONS]

Interactive Review (DEFAULT: ON):
  --yolo                     # Auto-approve all changes (skip prompts)

Jira Integration (DEFAULT: OFF):
  --enable-jira              # Post comments to Jira tickets
  --dry-run-integrations     # Preview Jira/PR without executing

PR Creation (DEFAULT: OFF):
  --create-pr                # Create GitHub PR after successful fix

Testing:
  --mode single              # Test one ticket per pattern
  --mode full                # Test all tickets in pattern
  --max-iterations N         # Max Solr optimization iterations
  --validation-cycles N      # Outer loop cycles with answer validation
  --include-judge-reasoning  # A/B test: include LLM judge critique

Retrieval Research Flags

# Compare retrieval strategies
scripts/compare_okp_vs_baseline.py [OPTIONS]
  --pattern PATTERN_ID       # Pattern to test (default: BOOTLOADER_GRUB_ISSUES)
  --details                  # Show per-query iteration details

# Test RAG extraction
src/heal/bootstrap/extract_jira_tickets_rag.py [OPTIONS]
  --limit N                  # Number of tickets to extract
  --force-rebuild            # Start fresh (ignore existing YAML)
  --tickets TICKET_IDS       # Specific tickets to test

# Compare YAML quality
scripts/compare_extracted_yamls.py [OPTIONS]
  --details                  # Show per-ticket comparison
  --ticket TICKET_ID         # Deep-dive on single ticket

๐Ÿ“Š Results & Metrics

Real Deployment Results (68 JIRA Tickets)

Metric Before HEAL After HEAL Improvement
Extraction Success 21% 100% 4.8x
Time to Extract 2-4 hours (manual) 10-15 min (autonomous) 10-20x faster
Answer Quality Unverified Score โ‰ฅ 0.7 (validated) โœ… Production-ready
Source Traceability None Every answer has URLs โœ… Auditable
Scope Detection Manual triage Auto-filters 38% noise โœ… Intelligent
Security Vulnerable to jailbreaks Auto-blocks attacks โœ… Protected
URL Accuracy Unknown Validated before synthesis โœ… Verified

Breakdown

  • โœ… 42 RHEL tickets extracted (100% success)
  • ๐Ÿšซ 26 meta-tickets filtered (38% noise reduction)
    • 8 jailbreak attempts blocked
    • 18 meta-tickets about CLA behavior
  • โฑ๏ธ Total time: 1-1.5 hours vs 100+ hours manual
  • ๐Ÿ“ˆ URL validation: ~30% reduction in answer refinement cycles

Retrieval Optimization Results

  • ๐Ÿ”ฌ RAG-enhanced edismax: 87.4% content relevance (vs 63.3% baseline)
  • ๐Ÿ’ก Discovery: URL F1 is wrong metricโ€”optimize for semantic relevance instead
  • ๐Ÿ’ฐ Cost savings: Can replace $0.01-0.05/query validation with $0 heuristic
  • ๐Ÿ“Š Research-driven: Data shows when to pivot strategy mid-experiment

๐Ÿ›ก๏ธ Security & Safety

Built-In Protection

Scope Check (Pre-Flight Filter):

  • Detects meta-tickets about AI behavior
  • Blocks jailbreak attempts and prompt injection
  • Filters non-RHEL questions (Windows, Ubuntu, etc.)
  • Runs BEFORE expensive LLM calls
  • Result: 8 jailbreak attempts blocked (0% success)

Interactive Review:

  • Human approval before code changes
  • Two checkpoints: reasoning + diff
  • Easy rollback with 'n'
  • Git isolation (fix branch only, never auto-merge)

Dry-Run Mode:

  • Preview Jira comments before posting
  • Test integrations safely
  • Review changes before applying

Audit Trail

Every answer includes:

  • โœ… Source URLs from official documentation
  • โœ… Confidence scores (Solr + Review Agent)
  • โœ… Reasoning for answers
  • โœ… Evaluation metrics (URL F1, answer correctness, etc.)
  • โœ… Git commits with full context

๐Ÿ“š Documentation

For Users

For Developers

Research & Optimization

For Contributors

  • AGENTS.md - Guidelines for AI agents working on this codebase
  • Testing - Test suite documentation
  • Coverage - Coverage tracking

๐Ÿค Contributing

We welcome contributions! Please see:

  • Code style: black, ruff, pylint
  • Tests: pytest with pytest-mock
  • Quality checks: make pre-commit
  • Documentation: Update relevant .md files

๐ŸŽ“ Use Cases Beyond RHEL

HEAL's architecture is product-agnostic. Adapt it to:

Component RHEL Other Products
Expert Agent Linux Expert Swap โ†’ Product Expert
Search Backend Solr (OKP) Any doc search API
Review Guidelines RHEL-specific Configure in YAML
Pattern Discovery Domain-independent No changes needed

Potential applications:

  • OpenShift documentation
  • Kubernetes knowledge bases
  • Enterprise software support
  • Medical/legal information systems
  • Any domain with authoritative documentation

๐Ÿ“ˆ Roadmap

Completed โœ…

  • Multi-agent extraction with autonomous quality loop
  • URL validation before synthesis
  • Interactive review with two approval checkpoints
  • Pattern discovery and clustering
  • Jira integration with dry-run preview
  • PR creation automation
  • Git safety (fix branches, no auto-merge)
  • Retrieval optimization research (RAG vs baseline comparison)
  • RAG-enhanced Solr Expert (87.4% content relevance)
  • Comparison framework for benchmarking strategies
  • Data-driven pivot discovery (Expected URLs Problem)

In Progress ๐Ÿšง

  • RAG extraction validation (testing if 87.4% content relevance โ†’ better answers)
  • A/B testing of judge reasoning impact
  • Correlation analysis (content relevance vs answer quality)
  • Search intelligence analytics
  • Pattern database integration

Planned ๐Ÿ”ฎ

  • Multi-repository PR coordination
  • Post-merge Jira automation
  • Cost tracking and optimization
  • Model escalation for hard problems
  • Self-healing with pattern database

๐Ÿ“ License

[License information]


๐Ÿ™ Acknowledgments

Built with:


๐Ÿ“ž Contact

  • GitHub Issues: [Coming Soon]
  • Discussions: [Coming Soon]
  • Email: [Contact information]

HEAL: Transforming RAG debugging from manual, error-prone work into autonomous, validated, scalable automationโ€”with research-driven optimization and human oversight at critical points.

"We don't just buildโ€”we measure, learn, and pivot based on data."

Status: Production Deployed | Version: 2.0 | Last Updated: April 2026

About

Heuristic Engine for Autonomous Labor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors