When Can You Trust an LLM to Judge Another?

A look at when LLM-as-a-judge works and when it doesn’t

Jun 14, 2025

You’ve probably seen evaluation setups where one model rates the answers of other models. This is called “LLM-as-a-judge.” It’s fast. It scales. And it doesn’t need humans in the loop.

But is it reliable?

Why Use LLM-as-a-Judge?

The main reason is speed. Human evaluations take time and money. If you’re comparing a few outputs, that’s fine. But if you’re running model tests every day or comparing thousands of prompts, you need something faster.

LLMs are also good at spotting simple issues like formatting mistakes, missed instructions, or irrelevant answers. In many cases, the results are close to human judgments.

That sounds great. But there are some caveats.

Common Problems

Here are some reasons why LLM-as-a-judge can go wrong:

1. Positional bias

Many setups compare two answers and ask the judge model to pick one. The judge often favors the response in the first position. Some papers try to correct this by randomizing order, but the bias still shows up in results.

2. Self-preference

If the judge model is from the same family as one of the candidates, it often favors that model. For example, Claude tends to prefer Claude, GPT prefers GPT. This makes sense. Models judge others through their own lens. But it also means the results can be skewed.

3. Overconfidence

Judging tasks that require reasoning or domain knowledge is tricky. The judge might give a confident score to an incorrect answer. If both answers look fluent, the model may pick the one that “sounds better” over the one that is factually correct.

4. Style over substance

LLMs are sensitive to phrasing. A concise answer might be better, but the model may prefer one that explains more or uses familiar patterns. That’s not always bad, but it can hide real differences in quality.

5. Poor correlation on hard tasks

Studies from Anthropic, DeepMind, and others show that on simple questions, LLM judges often agree with human raters. But on complex or nuanced tasks, the correlation drops. This includes open-ended reasoning, multi-hop QA, and moral judgment.

6. Replication issues

Results often vary between runs. If the judge model is sampled with temperature greater than zero, its rankings might change slightly each time. Even with deterministic settings, API updates or prompt tweaks can shift outcomes. If your results aren’t stable, it’s hard to compare models over time or debug regressions.

How to Make It More Reliable

Here are some ways teams try to improve trust in LLM-as-a-judge:

Use strong models: GPT-4 and Claude 3 show better judging ability than smaller models. But even they are not perfect.
Avoid close relatives: Don’t use the same model family to judge itself.
Use multiple judges: Averaging across models reduces bias.
Balance position effects: Randomize candidate order and repeat the eval both ways.
Fine-tune a judge model: Some teams train custom judge models with human-labeled data. This improves alignment with human preferences.
Spot-check with humans: Always sample a few outputs for manual review. If human and LLM ratings disagree often, don’t trust the automated scores.
Control for randomness: Use low temperature or deterministic decoding. Run multiple passes to test reproducibility.

So When Should You Use It?

LLM-as-a-judge is useful when:

You want fast feedback at scale
You’re comparing models on general tasks
You’ve calibrated the judge setup and checked for bias

Avoid relying on it when:

The task needs factual accuracy or domain expertise
The outputs are close in quality
The model being judged is from the same family as the judge

Use it as one signal, not the final decision.

Raphael Troncy

Jul 11

Thanks for this blog post! You may be interested in this recent paper we published at the EvalLLM 2025 workshop:

Sarra Gharsallah, Adele Robaldo, Mariia Tokareva, Giovanni Gatti Pinheiro, Ilyana Guendouz, Raphael Troncy, Paolo Papotti and Pietro Michiardi. Can We Trust the Judges ? Validation of Factuality Evaluation Methods via Answer Perturbation. In Workshop on Evaluation Generative Models (LLM) and Challenges (EvalLLM) colocated with TALN, Marseille (France), 2025.

Read our full blog post: https://giovannigatti.github.io/trutheval/

Watch a youtube explainer: https://www.youtube.com/watch?v=f0XJkMuyZlM

Play with our open source library: https://github.com/GiovanniGatti/trutheval/

Expand full comment

Alignment Layer

Discussion about this post