PMs Should Define What 'Good' Means for AI Features—Not Measure It
There's a version of the "empowered PM" that's quietly breaking AI teams: the PM who decides they need to own evaluations because they own the product spec. The logic feels sound. You own the outcome, so you should own the measurement. But in practice, this collapses the distinction between two very different jobs—and it makes PMs worse at both.
The argument here is specific: PMs should define what success looks like for an AI feature, including failure modes, acceptable error rates, and business constraints. Engineers and data scientists should own the eval framework, the methodology, and the iteration loop. Conflating these two responsibilities doesn't make the product better. It makes the PM a bottleneck and the evals less rigorous.
The Confusion at the Root of This Problem
When a PM says "I'm going to own evals," they usually mean one of two things:
- "I'll define what good looks like." This is PM work. It requires understanding users, business context, and what failure costs.
- "I'll design and run the evaluation framework." This is engineering and data science work. It requires statistical literacy, knowledge of eval methodologies (LLM-as-judge, human annotation pipelines, embedding-based similarity, regression test suites), and the ability to reason about measurement validity.
These are not the same job. The first requires deep product intuition. The second requires technical depth. Most PMs are strong at the first and underprepared for the second—not because they're incapable of learning, but because it's not where their leverage is.
The problem is that "evals" as a term has become a catch-all. In ML, an evaluation is a structured, statistically valid measurement of model behavior across a defined test set. It has methodology—how examples are selected, how outputs are scored, how results are aggregated. When PMs "run evals," what they typically do is look at 15-20 outputs, note the ones that feel wrong, and form an opinion. That's not an evaluation. That's a vibe check. And vibe checks create a specific, dangerous kind of false confidence.

How PM-Owned Evals Create False Confidence
Here's what actually happens when a PM takes ownership of running evals on an AI feature:
They look at outputs. Some are impressive. Some are bad. They focus on the impressive ones because those validate the direction they've already committed to. They flag the bad ones, but since they don't have a framework for understanding why the model failed or how frequently it fails at that type of input, the feedback they give engineers is anecdotal: "This output felt off" or "The tone here was wrong."
Engineers then fix the specific examples the PM flagged. The PM re-reviews. The fixed examples look better. The PM approves. The feature ships with failure modes that were never systematically tested because the eval process was driven by examples that happened to be visible, not a representative sample of real inputs.
This is cherry-picking with extra steps. The PM feels like they did rigorous quality work. The engineer feels like they satisfied the feedback. And the user encounters failure modes that neither of them saw coming.
There's also a speed problem. When PMs own eval execution, they become the critical path. Engineers finish a model update, then wait for the PM to review outputs. The PM has other priorities—roadmap work, stakeholder updates, user research. Reviews slip by days. Iteration slows. In a domain where model quality improves through rapid, tight feedback loops, this is a real cost.
What PMs Should Actually Specify
The PM's job in an AI feature isn't to measure quality—it's to define what quality means in terms that can be measured. That's a distinct skill, and it's genuinely hard to do well.
Concretely, this means PMs should produce:
- Success thresholds: Not "the output should be accurate" but "the feature should produce outputs that users rate as helpful ≥80% of the time in usability testing, and factual errors should occur in fewer than 5% of responses on our core use case inputs."
- User-facing failure modes, ranked by severity: What are the worst things the model can do? Hallucinating a competitor's product name is a different severity than hallucinating a medical dosage. PMs should define this hierarchy, because it determines how the engineer prioritizes eval coverage.
- Business constraints: Latency requirements, cost-per-query limits, compliance boundaries. These constrain what "good enough" means technically.
- Edge cases that matter most to the user: Not a comprehensive test suite—that's engineering work—but the scenarios that represent the highest-stakes or highest-frequency user situations. These become the inputs that the engineer's eval suite must cover.

Notice what's absent from this list: eval methodology, test set construction, scoring rubrics, inter-annotator agreement, LLM-as-judge prompt design. Those are engineering decisions. The PM's spec should be the input to those decisions, not a replacement for them.
The Collaboration Model That Actually Works
The right split looks like this:
PM owns:
- Writing acceptance criteria before engineering starts (not after)
- Defining the failure modes that matter to users and the business
- Setting the threshold for "good enough to ship"
- Reviewing eval results against acceptance criteria—not the eval methodology itself
- Deciding whether the product meets the bar, given what the evals show
Engineer/data scientist owns:
- Designing the eval framework and test set
- Selecting eval methodology (automated scoring, human annotation, regression suites)
- Running evals and iterating on model behavior
- Surfacing results in a format the PM can reason about
- Flagging when the acceptance criteria are technically unmeasurable or need refinement
Both iterate on:
- Whether the acceptance criteria are the right ones (sometimes engineering finds failure modes the PM didn't anticipate)
- Whether the eval results actually answer the product question the PM cares about
- The go/no-go decision before launch
PM Defines, Engineer Measures
- Evals are statistically valid and representative
- Engineers can iterate quickly without PM as bottleneck
- PM focuses on user impact, not methodology
- Failure modes are caught systematically, not anecdotally
- Clear ownership means clear accountability
PM Owns End-to-End Evals
- Review cycles slow iteration to PM's availability
- Cherry-picked examples create false confidence
- PM spends time on technical work outside their leverage
- Engineers optimize for what PM notices, not what matters
- Methodology gaps go undetected until production
The key mechanism here is the acceptance criteria document. If the PM writes clear, specific, measurable criteria before the engineer builds the eval framework, the engineer has a target to design toward. If the PM writes criteria after seeing the engineer's eval results, they're rationalizing whatever was built—which is exactly the rubber-stamping problem this split is meant to prevent.

Why This Distinction Is Harder Than It Sounds
There's a real tension in this model: PMs who don't understand eval methodology at all can't write acceptance criteria that are actually measurable. If you specify "the model should never hallucinate" as a success threshold, you've written a criterion that's technically unmeasurable (hallucination detection is itself an unsolved problem) and practically useless.
So the right answer isn't "PMs should stay ignorant of how evals work." It's "PMs should understand eval methodology well enough to write measurable criteria, without trying to own the execution."
That's a calibration question. A PM building AI features needs enough technical literacy to know:
- What kinds of outputs can be automatically scored vs. require human judgment
- Why a test set needs to be representative of real user inputs, not cherry-picked examples
- What a precision/recall tradeoff means for their specific feature
- Why "the model got this right" in a demo doesn't tell you anything about production performance
This is similar to how a PM who ships a data dashboard doesn't need to write SQL, but does need to understand whether the metric being tracked is the right one. The judgment layer is PM work. The execution layer is engineering work.
The Real Cost of Getting This Wrong
When PMs try to own evals without the technical chops to do it rigorously, two failure modes emerge—and they're both bad:
Rubber-stamping: The PM doesn't really understand what the engineer built, so they approve it because the outputs look reasonable on the examples they checked. Problems surface in production. The PM is blamed for approving something they couldn't actually evaluate.
Bottlenecking: The PM takes their quality ownership seriously and reviews every model update carefully. Iteration slows to the PM's review cycle. The engineering team loses momentum. The feature takes twice as long to reach acceptable quality because the feedback loop is broken.
Either way, the product suffers. The fix isn't to make PMs better at running evals. It's to redirect PM energy toward the thing that actually requires their specific expertise: defining what the product needs to do for users, in terms precise enough that engineers can build a framework to test it.
Owning the definition of success is harder than it sounds. Most PMs underinvest in writing specific, measurable acceptance criteria because it's easier to review outputs and give impressionistic feedback. But that's exactly where PM leverage is highest in AI development—not in the eval pipeline, but in the clarity of the target the pipeline is designed to hit.