Name	Name	Last commit message	Last commit date
parent directory ..
trainloop	trainloop
.env.example	.env.example
.gitignore	.gitignore
README.md	README.md
active_voice_rewriter.py	active_voice_rewriter.py
ai_request.py	ai_request.py
counter_agent.py	counter_agent.py
polite_responder.py	polite_responder.py
requirements.txt	requirements.txt
writes_valid_code.py	writes_valid_code.py

Name

Last commit message

Last commit date

active_voice_rewriter.py

TrainLoop Python Examples

Quick guide to run LLM evaluation examples.

Setup

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Create .env file with API keys
cp .env.example .env

Run Examples

# Code generation example (evaluates if LLM can write valid code)
python writes_valid_code.py

# Letter counting example (evaluates counting accuracy)
python counter_agent.py

# Customer support tone example (evaluates polite responses)
python polite_responder.py

# Active voice transformation example (evaluates style rewriting)
python active_voice_rewriter.py

# Run each script 3-4 times to collect samples
# Check collected data in trainloop/data/events/

Evaluate Results

# Install TrainLoop CLI globally (recommended)
pipx install trainloop-cli

# Or install in virtual environment
pip install -e ../../cli

# Check that it installed correctly
trainloop --version

# Run evaluation
cd trainloop
trainloop eval

Benchmark Models

# Compare different models
trainloop benchmark

What's Being Evaluated

Code Generation (`writes_valid_code.py`)

Tests how reliably models can generate valid, executable code.

Prompts for a recursive factorial function
Measures: syntax correctness, function behavior, error handling
Most modern LLMs score 100% on this task

Letter Counting (`counter_agent.py`)

Tests basic counting abilities that humans find trivial but LLMs often fail.

Prompts to count each letter in "strawberry"
Measures: format compliance, counting accuracy
Common failure: counting 'r' as 2 instead of 3 due to tokenization

Customer Support Tone (`polite_responder.py`)

Tests whether models can produce polite, empathetic customer service responses.

Prompts for a response to an angry customer complaint
Measures: politeness/apology, word count limit (≤120 words)
Evaluates tone, empathy, and solution-oriented approach

Active Voice Transformation (`active_voice_rewriter.py`)

Tests simple style transformation from passive to active voice.

Prompts to rewrite a passive sentence in active voice
Measures: successful voice transformation while preserving meaning
Evaluates basic writing style adaptation

Results are saved in trainloop/data/results/

View Results in Studio

# Launch the interactive UI to explore results
trainloop studio

This opens a web interface to visualize the events, results, and benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

TrainLoop Python Examples

Setup

Run Examples

Evaluate Results

Benchmark Models

What's Being Evaluated

Code Generation (`writes_valid_code.py`)

Letter Counting (`counter_agent.py`)

Customer Support Tone (`polite_responder.py`)

Active Voice Transformation (`active_voice_rewriter.py`)

View Results in Studio

FilesExpand file tree

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

TrainLoop Python Examples

Setup

Run Examples

Evaluate Results

Benchmark Models

What's Being Evaluated

Code Generation (writes_valid_code.py)

Letter Counting (counter_agent.py)

Customer Support Tone (polite_responder.py)

Active Voice Transformation (active_voice_rewriter.py)

View Results in Studio

Code Generation (`writes_valid_code.py`)

Letter Counting (`counter_agent.py`)

Customer Support Tone (`polite_responder.py`)

Active Voice Transformation (`active_voice_rewriter.py`)