AutomationBench AI benchmark leaderboard

Can AI models do real work? Zapier's AI benchmark assessment measures execution in realistic systems.

AutomationBench tests AI agents on end-to-end workflow execution using 47 real tools across six business functions: Sales, Marketing, Operations, Support, Finance, and HR. It's built on real patterns from 2B+ monthly tasks across 3.7M companies.

View on GitHub

Download the whitepaper

Realistic environments

Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.

Deterministic scoring

We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.

Leaderboard

Rank

Model

Score

Cost / task

Fable 5.0 (Max)

17.4%

$3.67

Fable 5.0 (XHigh)

16.0%

$3.03

Claude Opus 4.8 (XHigh)

15.5%

$2.36

Claude Opus 4.8 (Max)

15.4%

$3.14

Gemini 3.5 Flash (Medium)

14.5%

$0.87

Fable 5.0 (Medium)

12.9%

$2.59

GPT-5.5 (XHigh)

12.9%

$6.31

Gemini 3.5 Flash (High)

12.6%

$1.30

Fable 5.0 (High)

12.6%

$2.70

Gemini 3.5 Flash (Low)

12.2%

$0.65

By domain

Domain

Top Model

Score

2nd Place model

Sales

GPT-5.5 (XHigh) — OpenAI

17.9%

Claude Opus 4.8 (Max) — Anthropic (17.1%)

Marketing

GPT-5.5 (XHigh) — OpenAI

20.0%

Fable 5.0 (Max) — Anthropic / Claude Opus 4.7 (Max) — Anthropic (tie at 18.0%)

Operations

Fable 5.0 (Max) — Anthropic

27.0%

Fable 5.0 (XHigh) — Anthropic (23.0%)

Support

Claude Opus 4.8 (XHigh) — Anthropic

15.0%

Fable 5.0 (XHigh) — Anthropic (14.0%)

Finance

Fable 5.0 (Max) — Anthropic

12.5%

Claude Opus 4.8 (Max) — Anthropic (tie at 12.5%)

Fable 5.0 (Max) — Anthropic

20.0%

Claude Opus 4.8 (XHigh) — Anthropic (tie at 20.0%)

Try the latest models in Zapier

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.

Try the models