Dasha Maliugina

Posted on May 14

💡 10 learnings on LLM evaluations

#llm #ai #genai #machinelearning

If you are building with LLMs, you already know this: evaluation is hard. And it's nothing like testing regular software. You can’t just throw in some unit tests and call it a day.

Here are 10 key ideas on LLM evaluations, sourced from LLM evaluation for builders: applied course (It’s a free and open course with YouTube videos and code examples).

1. LLM evaluation ≠ Benchmarking

Evaluating LLM systems is not the same as evaluating LLM models. Public LLM benchmarks test general model capabilities, like math, code, or reasoning. But if you're building an LLM application, you need to evaluate it on your specific use cases.

For example, for a customer support chatbot, you need to test how well it grounds the answers on the company’s policies or handles tricky queries like comparing your product against competitors. It’s not just how it performs on MMLU.

2. Evaluation is a tool, not a task

LLM evaluation isn’t just a checkbox. It is a tool that helps answer product questions and supports decision-making throughout the product lifecycle:

During the experiments, evals help you compare different prompts, models, and settings to determine what works best.
Before deployment, you can run stress-testing and red-teaming to check how your system handles edge cases.
Once in production, you need to monitor how well your product is doing in the wild.
When you make changes, you need to run regression tests before shipping the updates.

So, to design evaluations well, you first need to figure out what you are looking to solve!

3. LLM evaluation ≠ Software testing

Unlike traditional software, LLM systems are non-deterministic. That means that the same input can yield different outputs. Also, LLM products often solve open-ended tasks, like writing an email, that do not have a single correct answer.

In addition, LLM systems bring a whole new set of risks. They can hallucinate and confidently make things up. Malicious users can attempt to jailbreak LLM apps and bypass their security measures. LLM app data leaks can result in exposing sensitive data.

Testing for functionality is no longer enough. You must also evaluate the quality and safety of your LLM system’s responses.

4. Combine manual and automated evaluations

You should always start with manual review to build intuition. Your goal is to understand what “good” means for your use case and spot patterns: are there any common failure modes or unexpected behavior?

Once you know what you’re looking for, you can add automation. But the important thing is that automated LLM evals are here to scale human judgment, not replace it.

5. Use both reference-based and reference-free evals

There are two main types of LLM evaluation methods:

Reference-based evals compare your system’s outputs to expected – or “ground-truth” – answers, which is great for regression testing or experiments. You can use such methods as exact match, semantic similarity, BERTScore, and LLM-as-a-judge.

Reference-free evals allow assessing specific qualities of the response, like helpfulness or tone. These are useful in open-ended scenarios and production monitoring. You can use text statistics, regular expressions, ML models, and LLM judges.

As you’ve probably guessed, you’ll need both types!

6. Think in datasets, not unit tests

Traditional testing is built around unit tests. With LLMs, it’s more useful to think in datasets. You need to test for a range of acceptable behaviors, so it’s not enough to run evaluations on a single example.

You may need to create diverse test sets, including happy paths, edge cases, and adversarial examples. Good evaluation datasets reflect both how users interact with your app in the real world and where things can go wrong.

7. LLM-as-a-judge is a key evaluation method

LLM-as-a-judge is a common technique to evaluate LLM-powered products. The idea is simple: you can use another LLM (or the same one!) to evaluate your system’s response with a custom evaluation prompt. For example, you can ask the judge whether your chatbot’s response is polite or aligns with the brand image.

This approach is scalable and surprisingly effective. Just remember, LLM judges aren’t perfect. You’ll need to assess and tune them, and invest in designing the evaluation criteria.

8. Use custom criteria, not just generic metrics

Since LLMs often solve very custom tasks, you must design quality criteria that map to your use case. You can’t just blindly use metrics like coherence or helpfulness without critically thinking about what actually matters.

Instead, you must define what “good” means for your app. Then, customize your evaluation to your domain, users, and specific risks. For a legal assistant, you can check whether the answer cites the correct regulations. For a wellness chatbot, you may need to ensure it answers in a friendly manner and does not provide medical advice.

9. Start with analytics

Evaluations are very analytical tasks.

To run LLM evals, you first need the data. So log everything: capture all the inputs and outputs, record metadata like model version and prompt, and track user feedback if you have it. If your app doesn’t have real users yet, you can start with synthetic data and grow from there.

You also need to manually analyze the outputs you get to determine your criteria and understand the failure modes you observe.

10. Evaluation is a moat

Building a solid LLM evaluation system is an investment. But it is also a competitive advantage:

Rapid iteration. Evals help speed up your AI product development cycle, ship updates stress-free, switch models easily, and debug issues efficiently.
Safe deployment. Evaluations allow you to test how an LLM system handles edge cases to avoid liability risks and protect customers from harmful outputs.
Product quality at scale. Finally, evals help ensure your LLM app works well and provides a good customer experience.

Your competitors can’t copy that — even if they use the same model for their LLM app!

🔥 Free course on LLM evaluations

Learn how to create LLM judges, evaluate RAG systems, and run adversarial tests. The course is designed for AI/ML Engineers and those building real-world LLM apps; basic Python skills are required. And yes, it’s free! Learn more and sign up.

DEV Community