How to Evaluate Agents and Agentic Workflows

Why normal LLM evals are not enough

Oct 21, 2025

Evaluating a single model is hard. Evaluating an agent is harder. An agent is not just one model call. It plans, calls tools, makes decisions, reads context, and runs over time. You cannot judge it with simple win rates or static benchmarks. You need to test how it behaves.

This post covers how to think about evaluating agents.

gray and orange plastic robot toy — Photo by Emilipothèse on Unsplash

What makes agents different

Agents are not single-turn systems. They have memory, state, and tools. They may call APIs, run code, search documents, or trigger actions. This means their output depends on a sequence of steps. A good agent may need ten steps. A bad one may loop forever.

So you cannot only evaluate the final answer. You also need to evaluate the process.

What to measure

You need new metrics for agents. Common ones:

Task success
Did the agent finish the task? This is the most basic check. It can be a simple yes or no.

Step quality
Did each step make progress? Was the plan logical? Did tool calls make sense? This is important when the final output is wrong but the reasoning is still valuable.

Efficiency
How many steps did it take? Too many steps means latency and cost problems.

Failure types
When it fails, how does it fail? Loops, hallucinations, tool misuse, forgotten instructions. These patterns matter.

Safety
Does it avoid harmful or risky actions? Safety checks must run inside the loop, not just at the end.

How to run evaluations

Static benchmarks do not work well for agents. You need scenario-based evaluation. That means giving the agent a goal and letting it run inside a sandbox. Then you check:

Did it reach the goal
Did it follow constraints
Did it use tools correctly
Did it avoid harmful actions

Example prompt:

Goal: Plan a three day trip to Tokyo. 
Constraints: Stay under 800 dollars. Include travel time. Output a daily schedule.

You then log every step and score the full trace.

Automatic vs manual judging

Automatic evals are useful for quick checks. You can use LLM-as-a-judge to score final results or inspect reasoning traces. But you should not rely only on them. Agents interact with tools and environments. Judges can miss errors that only humans catch, like silent planning failures or ignored constraints.

A practical setup:

Automatic scoring for basic metrics like success rate and step count
LLM-as-a-judge for quality checks
Human audits for safety and complex tasks

This balances speed and trust.

Regression tests

When you update your agent, you must make sure it still solves past tasks. Keep a replay set of scenarios. Run them daily or weekly. Compare results against past runs. If success rate drops, you caught a regression early.

This is the most useful tool for agent stability.

Failure analysis

Do not stop at pass or fail. Look for patterns:

Does the agent forget instructions
Does it ignore errors from tools
Does it loop when it gets stuck
Does it stop too early
Does it overuse or underuse tools

Many agent problems come from planning flaws. Eval should expose those, not hide them.

Final thoughts

Agent evaluation is not about a single score. It is about behavior. You need to log every step, replay old runs, measure success, trace failures, and watch for drift. If you only look at the final answer, you will miss most of the problems.

This is slower than single-model evals. But it is necessary. Unchecked agents do strange things. Good eval keeps them honest.

Thanks for reading Alignment Layer! This post is public so feel free to share it.

Alignment Layer