Model swaps look like configuration changes, but they behave more like product migrations. The product question is harder: if you change only the model, does the system still behave the way users expect? We tested 7 model targets under the same agent harness: same tasks, same fixture repo, same tools, same evaluator setup. Only the model changed. In the harnessed sweep, correctness stayed relatively close: 79.6% to 85.1%. The models landed in a similar correctness band, but they did not behave the same operationally. Can you swap models safely? Yes, sometimes. But only when the eval shows the behavior still meets the product bar. Nancy Chauhan wrote up what changed when we tested 7 models under the same agent harness. https://lnkd.in/gq2_B2qf
About us
The AI engineering platform for teams shipping reliable AI agents and LLM applications. Ship agents that work.
- Website
-
http://www.arize.com
External link for Arize AI
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- San Francisco, CA
- Type
- Privately Held
Locations
-
Primary
Get directions
San Francisco, CA, US
Employees at Arize AI
Updates
-
Always excited to partner with Google Cloud—this is a fun one!
The era of "just answering questions" is over. It’s time to build AI that gets things done. 🛠️ Join the Building Agents for Real-World Challenges hackathon! Combine Gemini’s reasoning with exclusive tools from our partners to build autonomous agents that execute and solve real problems. Are you building, or just talking? Let’s see what you’ve got → https://goo.gle/4eLNKQR
-
-
Excited to partner with GCG for this session on AI observability in production. Join us Tuesday May 26 at 7:00 AM PST to learn how enterprise teams can tackle model degradation, fragmented observability, and evaluation at scale. *Session is in Spanish*
Te invitamos a un webinar junto a Arize AI donde vamos a hablar sobre el costo oculto de la #IA sin observabilidad. Veremos cómo los equipos enterprise pueden monitorear, evaluar y mejorar modelos y agentes en producción, evitando degradación silenciosa, falta de trazabilidad y costos difíciles de anticipar. 🚀 También compartiremos demos en vivo y un caso real de una compañía de servicios financieros operando con más de 50M de spans mensuales. 🙌 Sumate al encuentro aqui: https://lnkd.in/dmeFU5b8 #AI #EnterpriseAI #AIObservability #LLMOps #MLOps #Arize #GCG
El costo oculto de la IA sin observabilidad
www.linkedin.com
-
Hot off the presses, Gemini 3.5 Flash is now available in the Prompt Playground and throughout Arize AX! https://app.arize.com
-
-
The agent framework space has gotten busy fast. Sam Bhagwat of Mastra is joining Observe to talk about what production teams actually need from a TypeScript-first agent stack. If you're a JS/TS shop trying to decide where to anchor your agent code, this conversation will save you a quarter of trial and error. June 4, SF → https://arize.com/observe
-
-
🛠️ One AI Question with Elizabeth Hutton We asked our Senior Software Engineer: Why should you learn about evals? Her answer: Complex AI needs more trust, not less. As systems get smarter, evaluations are the only way to verify performance and ensure your AI is actually doing what it's supposed to do. Evals aren't optional—they're the foundation. #AI #AIEvals #LLM
-
LLM-as-a-judge only works in production when the judge knows exactly what it is judging. A fluent answer is not the same as correct system behavior. If a refund agent says “your refund was processed” but never called the refund tool, a “helpful” score is a bad eval. Instead, you should: - Use code for deterministic checks. - Use LLM judges for semantic checks. - Use humans to calibrate edge cases. - Use traces to explain where the failure came from. For agents, judging the final answer is not enough. The response may look right while the trajectory is wrong: bad tool choice, hallucinated arguments, ignored tool errors, redundant loops, or unsupported claims. A judge is good when it improves engineering decisions. More here: https://lnkd.in/dhXjtkkc
-
Your AI agent disagrees with your human reviewers all day. Most teams treat that as noise. They're the most useful training data in your stack — vendor relationships, deadline pressure, the way your CFO actually thinks. Jim Bennett wrote up how to mine the gap and feed it back to the agent. https://lnkd.in/gMq2Ygmu
-
-
Docs aren't just for humans anymore. Every coding agent, RAG pipeline, and copilot is reading them too, and they read differently. They truncate, skip pages they can't parse, and trim content before it reaches the model. We built our docs to hold up for every agent that reaches for them. Find us near the top of the Mintlify agent score leaderboard: mintlify.com/score
-
-
Can you get world-class agents using harness engineering instead of fine-tuning? OpenAI thinks so.
OpenAI is shutting down its fine-tuning APIs. It doesn't mean fine-tuning is dead, but it's a strong signal that fine-tuning isn't what the average AI engineer wants to do. So what are they doing instead?