Two people sit with the same model, the same access, the same company data behind it. One walks away thrilled. One walks away convinced the thing is useless. I keep watching this happen, and the difference has nothing to do with how they phrase the prompt.
The one who gets good work out of it can look at an answer and tell, fast, whether it is any good. That instinct has a name. We call it taste. And taste turns out to be the same thing I spend my days trying to write down for machines.
The Pattern I See
Watch two people work the same tool on the same task. Both follow the usual prompt advice: be specific and give the model context.
Pippo gets a generic, mediocre answer and accepts it. He does not register that anything is off. He moves on.
Saro gets a near-identical answer, frowns, and asks for a rewrite with more detail. The second pass is better. He pushes once more and lands on something genuinely good.
Same starting prompt. Same tool. Different results.
The gap sits in one place: Saro can look at an output and say "this is not right yet," then point at what to fix. Call it judgment about quality, the ability to feel that an answer is thin before you can fully argue why. Prompt tricks have little to do with it.
prompt write the launch announcement for our new feature
same prompt, same tool, same access. the only thing changing below is judgment.
What I Mean by Taste
Taste here is the act of judging quality. You know when an answer is good rather than merely okay. The flat hum of generic AI prose reaches you, and so does the sense of which way to push and when to stop. You feel all of it before you can defend it, the way you can look at a design and sense something is off, or read code that runs and still smell that it is fragile.
Now move that instinct one floor up, from a person at a chat box to an agent running inside a company. The agent reasons over scattered, half-contradictory documents and produces an answer. Someone has to judge it. Is it correct? Did it cite the source it drew from? When the data was not there, did it have the nerve to say "I don't know"? That judgment is taste again, applied to a machine instead of a paragraph.
An Eval Is Taste Written Down
Here is the connection I keep circling. An eval is an attempt to write your taste down so a machine can apply it for you. When I build Tessera, an MCP-native generator that scores an agent on accuracy, provenance, and correct refusal, I am taking the frown that Saro makes at a thin answer and turning it into something a test runner can check ten thousand times without me in the room.
That turns the whole problem into a real engineering question: which parts of your taste are you willing to commit to? "This answer cites the wrong contract clause" is a judgment a human makes in a second and an eval can encode. "This refusal was the honest move" is harder, and writing it down forces you to say exactly what honest means. The eval is only as good as the taste behind it, which is humbling, because it means a vague sense of quality produces a vague test, and a vague test lets bad answers through.
So taste and evals are one instinct at two scales. The small scale is you, squinting at a draft. The large scale is a suite of checks standing in for you across every run, so the agent can be trusted without a person reviewing each answer. Writing that suite is mostly the work of making your own taste explicit enough to survive being automated.
Why This Matters
If quality lives in judgment, then someone without taste gets thin results, decides the tool is useless, and quits. Give the same tool to someone who keeps steering and the work climbs. The same split shows up in agents. A team that can articulate what a good answer looks like ships an agent you can rely on, while a team that only half-knows ships one that sounds confident and is occasionally, quietly, wrong.
When people ask how to get good with AI, I have stopped saying "learn better prompts." I say: look at a pile of outputs, learn to feel the difference, and practice writing down what is wrong. Build reference points. That practice is exactly how you learn to write evals, which is why I think the two skills converge.
The honest gap: I cannot fully prove taste and evals are the same thing. A lot of what makes an answer good still resists being written down, and the parts that resist hardest may be the parts that matter most. Maybe the residue that escapes every eval is just taste I have not learned to articulate yet. Or maybe some of it never compresses into a rule, and the machine will always need a human to make the final call. I keep building the tests as if the first answer is true. I lose sleep suspecting the second one is.