Measuring What Matters
Notes on evaluating LLMs in the real world
Why I’m starting this
LLMs are getting better fast. But it’s hard to tell how good they really are.
I’ve been working with teams that build and evaluate LLMs. Some of them train the models. Some fine-tune them. Others try to make them more useful, safer, or aligned with a product goal. Everyone wants to know if what they’re doing is actually helping.
Right now, LLM evaluation is scattered. Some teams run benchmarks. Others rely on intuition or product metrics. That’s fine, but it makes it hard to compare progress. And it makes it hard to trust the results.
This space is where I’ll share what I’ve learned about evaluating LLMs - and what I’m still figuring out.
What to expect
This won’t be a tutorial blog. You won’t see many equations. I’ll write about practical problems that show up when you try to evaluate models in real life.
Some things I plan to cover:
What makes a good eval
Why evals break when models get better
How to balance automatic and human evaluation
What we still don’t understand about model behavior
How to measure usefulness across different product surfaces
Most posts will be short. One idea at a time. No filler.
Image source: Glen Carrie from Unsplash
What kind of community I want
If you work on LLMs - especially if you’re responsible for showing they’re good - I hope this space helps you think more clearly.
If you’re a PM, a researcher, or a manager trying to build reliable systems around models, I think you’ll find this useful.
This will be a space for honest questions, unfinished ideas, and tradeoffs. Not hype. Not answers that pretend to be universal.
Logistics
I’ll post about twice a month.
It’ll be free for now.
If I ever add a paid option, it’ll be for more detailed write-ups, tools, or interviews. But not yet.
Why now
LLMs are moving into every product. Everyone wants them to be helpful, safe, and aligned. But if we can’t measure that well, it’s all just guesswork.
Evaluation isn’t the most glamorous part of ML. But it’s one of the most important. And not enough people talk about it.
So I’m starting.
If that sounds interesting, subscribe below.


