Benchmarking Your LLM Application in Production
Making offline evals match what happens in production
Most teams have a benchmark set. They use it to test new models and track progress. The problem is many of these benchmarks don’t actually predict what happens when you launch. A model that wins on the benchmark may not improve user metrics at all.
So how do you build a benchmark that matters?
Start from your product goals
The first step is to connect benchmarks to what you care about in production. If your product is a search engine, you should measure relevance. If it’s a summarizer, you should measure accuracy, coverage, and style. If you’re building a safety system, you should measure hallucination, bias, and harmful output.
A generic benchmark like “helpfulness” is rarely enough. You need benchmarks that match the outcomes you want users to see. Otherwise you’ll be optimizing in the wrong direction.
Use real prompts, not just synthetic
Synthetic prompts are easy to generate. But they don’t capture the messiness of real user behavior. Users ask ambiguous, vague, and sometimes strange questions. If your benchmark ignores this, it won’t tell you much about production performance.
The best benchmarks include real prompts from logs or pilot runs. Clean them, anonymize them, and tag them by type. Even a small set of real data is more valuable than a large synthetic set that looks nothing like reality.
Balance coverage
A good benchmark has range. If all your prompts are easy, the test won’t show real differences. If they all come from one domain, you’ll miss blind spots.
Include a mix of task types. For example, fact recall, reasoning, open-ended writing, and multi-turn dialogue. Include a mix of difficulty too. Some prompts should be trivial. Others should be hard. This lets you see where models improve and where they still fail.
Keep it stable, but refresh when needed
You need stability to track trends over time. If you change the benchmark every month, you can’t compare scores. But you also can’t freeze it forever. As users change, your benchmark should adapt.
One way is to keep a “golden test set” of prompts that never change, plus a rotating slice that refreshes every quarter. This gives you both consistency and coverage of new behaviors.
Check correlation with production
The real test of a benchmark is whether it predicts product outcomes. After a model launch, compare the benchmark results with real user metrics. Did the model that won on the benchmark also improve engagement, satisfaction, or safety?
If the answer is yes, you’re on the right track. If not, adjust the dataset. Benchmarks are not one-and-done. They should evolve until they line up with reality.
Build for iteration
Think of benchmarks as living tools, not static tests. They should evolve with your product and your users. They should also be cheap to run and easy to extend. If it takes weeks to add a new prompt or re-run the set, you won’t keep it fresh. Don’t over-engineer your Human-in-the-Loop process.
A small, well-designed benchmark that is easy to maintain will always beat a giant benchmark that nobody trusts.
Final thoughts
A good benchmark is not about size. It’s about alignment. It should reflect your goals, use real prompts, cover a balanced range of tasks, stay stable but refresh over time, and always be checked against production outcomes.
If you build it that way, your offline evals will stop being an academic exercise. They’ll actually tell you what matters for your users.

