Even at 90% consistency, frontier models still contradict themselves at scale
Why the latest generation of models is not yet production-ready for high-stakes legal work
The latest generation of frontier models reaches the same conclusion on a legal question roughly 90% of the time. At scale, that gap still produces contradictory answers to the same question every single week.
We've developed proprietary benchmarks for legal reasoning, maintained by attorneys at Norm.
Because Norm Law attorneys, while serving as outside counsel to hedge funds, PE firms, and leading asset managers, deploy AI agents in their day-to-day work, we can uniquely build real legal AI benchmarking.
We've been tracking frontier models across generations, and the trend is clear: models are improving substantially in their legal reasoning capabilities. The latest generation of models are nearly indistinguishable from each other in accurately answering legal questions.
Most models are increasingly consistent, with the most recent generation of frontier models reaching the same conclusion roughly 90% of the time.
But for high-stakes legal work, even the best models on this benchmark reach a different answer often enough that, at scale, users receive contradictory answers to the same question every week.
To integrate AI agents into high-stakes legal workflows, you need both: (1) purpose-built systems that can constrain, verify, and govern AI reasoning automatically, and (2) the human overlay of expertise for live workflows fully intertwined with AI agents in a deliberate process.