How to Collect Human Preference Data and Train a Reward Model That’s Actually Useful

You don’t need a huge dataset or perfect labels, just the right structure

Aug 06, 2025

Human feedback is noisy and slow. Reward modeling sounds complex. But you can still build something useful with just a few thousand examples, if you structure things well.

This post covers how to collect lightweight human preference data and how to turn it into a working reward model.

Part 1: Getting the Feedback

a blue background with a bunch of cookies and a red object — Photo by Igor Omilaev on Unsplash

Start with thumbs (if you have them)

If your product already collects thumbs up or down, use that. It’s free data. It’s noisy, but you can still use it for filtering, sampling, or tuning metrics.

Make sure you can tie it back to the model and prompt that generated the response.

Run small, structured preference tasks

If you need better signal, run side-by-side comparisons. Show two model outputs and ask, “Which is better?”

This is simple and works across many tasks. You can use internal reviewers or a vendor team. You don’t need thousands of examples to get started.

To make the data useful:

Randomize Model A / Model B order
Write clear rater instructions
Label task type (e.g. summarization, QA, chat)
Track ties or “neither” votes if needed

Even a few hundred comparisons can help you pick between two model variants.

Sample where it matters

Focus on areas where:

Model win rates are close
Offline evals don’t match product metrics
You’re optimizing something subjective (like tone)

It’s okay to ignore parts of the output space where you already have confidence.

Store it well

Once you collect feedback, store it in a structured format:

{
  "prompt": "...",
  "response_a": "...",
  "response_b": "...",
  "winner": "a",
  "task_type": "summarization",
  "annotator": "vendor_batch_3"
}

This will let you filter, split, and fine-tune later.

Thanks for reading Alignment Layer! This post is public so feel free to share it.

Part 2: Training the Reward Model

Now that you have human preferences, you can train a model to predict them. This becomes your reward model.

You don’t need to go big. You can train a simple head on top of an open base model. Even a small model can help you rank responses or debug regressions.

What data to use

Use only the comparisons where the rater picked A or B clearly. You can drop ties for now.

Each example becomes a pair: “A was preferred to B.” You train the model to score A higher than B.

You’ll want:

A few thousand examples minimum
A consistent format
A held-out set for eval (not seen during training)

How to train it

There are two standard options:

Pairwise loss (Bradley-Terry-style)
You give both responses and train the model to score the preferred one higher.
Direct Preference Optimization (DPO)
You treat the base model as fixed and fine-tune a policy model directly using preferences. This skips learning an explicit reward model but still uses the same data.

You can start with a simple pairwise loss using a transformer head. There are open-source libraries like TRL that help with this.

How to use it

Once trained, your reward model can:

Rank model outputs automatically
Filter poor generations
Score outputs during training or eval
Provide feedback to new models (e.g., via RLHF or DPO)

Even if it’s not perfect, a small reward model gives you a reusable signal. Over time, you can keep fine-tuning it with more data.

Final thoughts

You don’t need a massive human data pipeline to start using preferences. A few thousand (or even a few hundred) well-structured comparisons can already improve your decision-making.

And once you’ve got that data, you can train a simple reward model that helps you move faster, debug smarter, and test more ideas.

Alignment Layer