Almost everyone judges an agent the same way. They watch it do the thing once, it works, and they're sold. The demo is charming. It is also the weakest evidence you will ever collect.
Think about what a demo actually is. One input. One run. A human standing right there, ready to forgive a small stumble and round up to "close enough." Production is none of those things. It is thousands of runs you never see, on inputs you didn't pick, with no one waiting to catch the miss. The demo answers "can it do this once." The only question that matters is "how often does it get this right, and how would I know."
You answer that with a test set. Not a vibe. A fixed list of real inputs, each paired with what a good answer looks like — sometimes an exact result, sometimes a clear pass-or-fail rule. You run the agent against the whole list every time you change something, and you count. That is the entire trick. It is closer to a spreadsheet than to science.
Here is why the spreadsheet beats your judgment. Without it, "is the agent good" is answered by memory, and memory is generous. You remember the three outputs that delighted you and quietly forget the two that were wrong. A score doesn't flatter anyone. Forty cases, thirty-six right, and now you are talking about a number instead of a feeling.
The payoff shows up the moment something changes. A new model drops and you want to switch. With a test set, you swap it, re-run the list, and read the result in minutes. Without one, you are guessing — and you won't find out you guessed wrong until a customer does. The test set is what lets you move fast without moving blind.
Building one is not glamorous, which is exactly why most teams skip it. Collect twenty to fifty real cases — the boring ones and the nasty edge ones. Write down what "good" means for each. Re-run them on every change to the prompt, the model, or the tools. That's it. The discipline is the product.
This is also where taste stops being a mood and becomes a measurement. You decide once what a good answer looks like, you write it down, and then every version of the system has to clear that bar. The standard stops living in one person's head and starts living in the test. That is how taste scales without getting watered down.
So before you trust an agent, ask the unglamorous question. Not "did it impress me," but "what's the score, and on which cases does it fail." If nobody can answer, you don't have a system yet. You have a demo.
Takeaway: a demo is a feeling; a test set is a fact. Measure the agent on real cases before you trust it with real work. ✱
